Using Hierarchically Tiled Arrays for Locality

Transcript Using Hierarchically Tiled Arrays for Locality

Programming with Tiles
Jia Guo, Ganesh Bikshandi*,
Basilio B. Fraguela+, Maria J. Garzaran,
David Padua
University of Illinois at Urbana-Champaign
*IBM, India
+Universidade da Coruna, Spain
Motivation
 The importance of tiles
 A natural way
to express many algorithms
 Partitioning data is an effective way to enhance locality
 Pervasive in parallel computing
 Limitations of today’s programming language
 Lack of programming construct to express tiles directly
 Complicated indexing and loop structure to traverse an array
by tiles
 No support for storage by tile.
 Mixed
the design and detailed implementation.
2
Contributions
 Our group developed Hierarchically Tiled Arrays
(HTAs) to support tiles to enhance locality and
parallelism (PPoPP 06).
 Designed new language constructs based on our true
experiences.
 Dynamic
partitioning
 Overlapped tiling
 Evaluated both the productivity and performance
3
Outline
Introduction of HTA
Dynamic partitioning
Overlapped tiling
Impact on programming productivity
Related work
Conclusions
4
HTA overview
A(0,0)
A(0,1)[0,1]
A+B
op
A(0, 0:1) = B (1, 0:1)
A(1, 0:1)
A
5
HTA library implementation
 In Matlab
 In C++
 Support
sequential and parallel execution on top of MPI
and TBB.
 Support for linear, non-linear data layouts
 Execution model
 SPMD
 Programming
 Single
model
threaded
6
Outline
Introduction of HTA
Dynamic partitioning
Overlapped tiling
Impact on programming productivity
Related work
Conclusions
7
Why dynamic partitioning?
Some linear algebra algorithms
 Add
and remove partitions to sweep through
matrices.
Cache oblivious algorithms
 Create
the partition following a divide and
conquer strategy
8
Syntax and semantics
 Partition lines and partition
NONE for not
partitioning
 Two methods


void part( Tuple sourcePartition, Tuple offset );
void rmPart( Tuple partition );
fixed
0 1
2
0
1
2
A
0 1
2 3
0
0
1
0
2
1
3
2
A.part((1,1),(2,2))
1 2
A.rmPart((1,1))
9
LU algorithm with dynamic
partitioning
done done
nb
In the loop
done
Partially
updated
Beginning
of iteration
A11
A12
A21
A22
nb
Repartition
A.part((1,1),(nb, nb))
Update
A  L 
P1  11    11  U11
A 21  L 21 
......
done
done
done
Partially
updated
End of
iteration
A.rmPart((1,1))
A 22  A 22  L 21 U12
10
LU algorithm represented in FLAME [Geijn, et al]
The FLAME algorithm
In the loop
UPD
UPD
UPD
UPD
11
From algorithm to HTA program
The FLAME algorithm
The HTA implementation
void lu(HTA<double,2> A, HTA<int,1> p,int nb)
{
A.part((0,0),(0,0));
p.part((0), (0));
while(A(0,0).lsize(1)<A.lsize(1)){
int b = min(A(1,1).lsize(0), nb);
A.part((1,1),(b,b));
p.part((1), (b));
dgetf2(A(1:2,1), p(1));
dlaswp(A(1:2,0), p(1));
dlaswp(A(1:2,2), p(1));
trsm(HtaRight, HtaUpper, HtaNoTrans,
HtaUnit, One, A(1,1),A(1,2));
gemm(HtaNoTrans, HtaNoTrans, MinusOne,
A(2,1), A(1,2), One, A(2,2));
A.rmPart((1,1));
p.rmPart((1));
}
}
12
FLAME API vs. HTA
A.part((0,0), (0,0));
A.part((1,1), (b,b));
A.rmPart((1,1));
A(1:2, 1);
HTA:
1). General. 2). Fewer variables.
3). Simple index. 4). Flexible range selection
13
A cache oblivious algorithm:
parallel merge
1. Take middle element in in1
in1
3
in2
2
4
11 14 22 30 32 41
2. Find lower bound in in2
4
10 15 18 23 28 40
3. Calculate partition in out
out
4. Partition (logically)
3
4
11 14
22 30 32 41
2
4
10 15 18
23 28 40
5. Merge in parallel
14
HTA vs. Treading Building Blocks (TBB)
HTA code
TBB code
map(PMerge(), out, in1, in2);
HTA: partition tiles and operate on
tiles in parallel
TBB: prepare the merge range and
operate on smaller ranges in
parallel
parallel_for(PMergeRange(begin1, end1, begin2,
end2, out), PMergeBody());
15
Evaluation
Benchmark
MMM
Category
Execution
mode
Cache oblivious algorithm
Recursive LU Cache oblivious algorithm
Dynamic LU FLAME’s algorithm
Sylvester
Sequential
Platform
3.0 GHz Intel
Pentium 4 ,
16KB L1, 1MB
L2 and 1GB
RAM.
FLAME’s algorithm
Parallel merge Cache oblivious algorithm
Parallel
Two Quad core
2.66 GHz Xeon
processors
16
Recursive
MMM
Dynamic LU
Recursive LU
Sylvester
17
Parallel merge
18
Outline
Introduction of HTA
Dynamic partitioning
Overlapped tiling
Impact on programming productivity
Related work
Conclusions
19
Motivation
 Programs that have computations based on neighboring points



E.g. iterative PDE solvers
Benefit from tiling
Shadow regions
 Problems with current approach


Explicit allocation and update for shadow regions
Example: NAS MG (Lines of code)
Versions
comm3
Residual
Inversion
Projection
MPI
327
23
24
49
CAF
302
25
23
55
OpenMP
26
27
26
59
HTA
33
55
53
64
20
Overlapped tiling
 Objective

Automate the allocation and update of shadow regions
 Allow the access of neighbors across neighboring tiles
 Allow programmer to specify overlapped region at
creation time
Overlap<DIM> ol = Overlap<DIM>(Tuple negativeDir, Tuple positiveDir,
BoundaryMode mode,
bool autoUpdate=true);
21
Example of overlapped tiling
Boundary
Boundary
Shadow region for
Shadow region for
0
0
overlap
Shadow region for
overlap
Tuple::seq tiling = ((4), (3));
Overlap<DIM> ol((1), (1), zero);
A=HTA<double,1>::alloc(1, tiling, array, ROW, ol);
22
Indexing and operations
 Indexing
T[0:3]=T[ALL]
T[-1]
T[4]
owned region
 Operations (+,-, map, =, etc)
 The
conformability rules only apply to owned regions
 Enable operations with non-overlapped HTA.
23
Shadow region consistency
Ensure shadow regions to be properly
updated and consistent with the
corresponding data in the owned tiles.
Use update on read policy
 Bookkeeping
 SPMD
the status of each tile.
mode: no communication is needed.
Allow manual and automatic updates.
24
Evaluation
Benchmark
Sequential 3D
Jacobi
Description
Execution
mode
Stencil computation on 6 neighbors sequential
NAS MG
Multigrid V-cycle algorithm
parallel
NAS LU
Navier Strokes equation solver
parallel
A cluster consisting of 128 nodes each with two 2 GHz G5 processors and 4
GB of RAM. We used one processor per node in our experiments with
Myrinet connection. NAS code (Fortran + MPI) was compiled with g77 and
HTA code was compiled with g++ (3.3). O3 flag is used for both cases.
25
MG comm3 in HTA with
without
overlapped
overlapped tiling
tiling
Overlap<3> * ol = new Overlap<3> (T<3>(1,1,1),T<3>(1,1,1), PERIODIC);
void comm3 (Grid& u)
{
int NX = u.shape().size()[0] - 1;
int NY = u.shape().size()[1] - 1;
int NZ = u.shape().size()[2] - 1;
int nx = u(T(0,0,0)).shape().size()[0] - 1;
int ny = u(T(0,0,0)).shape().size()[1] - 1;
int nz = u(T(0,0,0)).shape().size()[2] - 1;
//north-south
Traits::Default::async();
if (NX > 0)
u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(1, NX), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1)));
u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(0,0), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1)));
if (NX > 0)
u((R(1, NX), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1)));
u((R(0, 0), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1)));
Traits::Default::sync();
Traits::Default::async();
//east-west
if (NY > 0)
u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(1, 1), R(1, nz-1)));
u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(0,0), R(0, NZ)))((R(0,nx), R(1, 1), R(1, nz-1)));
if (NY > 0)
u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny-1, ny-1), R(1, nz-1)));
u((R(0, NX), R(0, 0), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0,nx), R(ny-1, ny-1), R(1, nz-1)));
Traits::Default::sync();
Traits::Default::async();
//front-back
if (NZ > 0)
u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(1, 1)));
u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(1, 1)));
if (NZ > 0)
u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz-1, nz-1)));
u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz-1, nz-1)));
Traits::Default::sync();
}
26
MG comm3 in NAS (Fortran + MPI)
subroutine comm3(u,n1,n2,n3,kk)
implicit none
subroutine take3( axis, dir, u, n1, n2, n3 )
implicit none
include 'mpinpb.h'
include 'globals.h'
include 'mpinpb.h'
include 'globals.h'
subroutine give3( axis, dir, u, n1, n2, n3, k )
implicit none
include 'mpinpb.h'
include 'globals.h'
integer axis, dir, n1, n2, n3
double precision u( n1, n2, n3 )
integer axis, dir, n1, n2, n3, k, ierr
double precision u( n1, n2, n3 )
integer n1, n2, n3, kk
double precision u(n1,n2,n3)
integer axis
integer buff_id, indx
if( .not. dead(kk) )then
do axis = 1, 3
if( nprocs .ne. 1) then
call mpi_wait( msg_id( axis, dir, 1 ),status,ierr)if( axis .eq. 1 )then
if( dir .eq. -1 )then
buff_id = 3 + dir
indx = 0
do i3=2,n3-1
do i2=2,n2-1
if( axis .eq. 1 )then
buff_len = buff_len + 1
if( dir .eq. -1 )then
buff(buff_len,buff_id ) = u( 2, i2,i3)
enddo
do i3=2,n3-1
enddo
do i2=2,n2-1
indx = indx + 1
call mpi_send(
u(n1,i2,i3) = buff(indx, buff_id )
>
buff(1, buff_id ), buff_len,dp_type,
enddo
>
nbr( axis, dir, k ), msg_type(axis,dir),
enddo
>
mpi_comm_world, ierr)
else if( dir .eq. +1 ) then
else if( dir .eq. +1 ) then
do i3=2,n3-1
do i3=2,n3-1
do i2=2,n2-1
do i2=2,n2-1
indx = indx + 1
buff_len = buff_len + 1
u(1,i2,i3) = buff(indx, buff_id )
buff(buff_len, buff_id ) = u( n1-1, i2,i3)
enddo
enddo
enddo
enddo
endif
call mpi_send(
endif
>
buff(1, buff_id ), buff_len,dp_type,
>
nbr( axis, dir, k ), msg_type(axis,dir),
if( axis .eq. 2 )then
>
mpi_comm_world, ierr)
if( dir .eq. -1 )then
call ready( axis, -1, kk )
call ready( axis, +1, kk )
give3
integer status(mpi_status_size), ierr
integer i3, i2, i1
integer i3, i2, i1, buff_len,buff_id
buff_id = 2 + dir
buff_len = 0
take3
call give3( axis, +1, u, n1, n2, n3, kk )
call give3( axis, -1, u, n1, n2, n3, kk )
call take3( axis, -1, u, n1, n2, n3 )
call take3( axis, +1, u, n1, n2, n3 )
else
call comm1p( axis, u, n1, n2, n3, kk )
endif
enddo
else
call zero3(u,n1,n2,n3)
endif
return
end
subroutine ready( axis, dir, k )
implicit none
include 'mpinpb.h'
include 'globals.h'
ready
integer axis, dir, k
integer buff_id,buff_len,i,ierr
buff_id = 3 + dir
buff_len = nm2
do
do
i3=2,n3-1
do i1=1,n1
indx = indx + 1
u(i1,n2,i3) = buff(indx, buff_id )
enddo
enddo
if( axis .eq. 2 )then
if( dir .eq. -1 )then
do
i3=2,n3-1
do i1=1,n1
buff_len = buff_len + 1
buff(buff_len, buff_id ) = u( i1,
enddo
enddo
else if( dir .eq. +1 ) then
do
i3=2,n3-1
do i1=1,n1
indx = indx + 1
u(i1,1,i3) = buff(indx, buff_id )
enddo
enddo
i=1,nm2
buff(i,buff_id) = 0.0D0
enddo
endif
endif
msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me
if( axis .eq. 3 )then
if( dir .eq. -1 )then
call mpi_irecv( buff(1,buff_id), buff_len,
>
dp_type, nbr(axis,-dir,k), msg_type(axis,dir),
>
mpi_comm_world, msg_id(axis,dir,1), ierr)
return
end
endif
endif
do
i2=1,n2
do i1=1,n1
indx = indx + 1
u(i1,i2,n3) = buff(indx, buff_id )
enddo
enddo
>
>
>
2,i3)
call mpi_send(
buff(1, buff_id ), buff_len,dp_type,
nbr( axis, dir, k ), msg_type(axis,dir),
mpi_comm_world, ierr)
else if( dir .eq. +1 ) then
do
i3=2,n3-1
do i1=1,n1
buff_len = buff_len + 1
buff(buff_len, buff_id )= u( i1,n2-1,i3)
enddo
enddo
call mpi_send(
27
3D Jacobi
NAS LU
class B
NAS MG
class C
NAS LU
class C
28
Outline
Introduction of HTA
Dynamic partitioning
Overlapped tiling
Impact on programming productivity
Related work
Conclusions
29
Three metrics
 Programming effort [Halstead, 1977]

Program volume V


A function of the number of operators and operands and their total
number of occurrences.
Potential volume V*

The most succinct form in a language which has defined or implemented
the required operations.
V2
E 
V*
 Program complexity [McCabe,1976]

C = P + 1, where P is the number of decision points in the program
 Source lines of codes L
30
Evaluation
Programs
Effort
Complexity LOC
staticLU HTA
61,545
3
10
staticLU NDS
208,074
8
49
staticLU LAPACK
160,509
6
37
dynamicLU HTA
51,599
1
13
dynamicLU FLAME
170,477
1
52
recLU HTA
85,530
1
18
recLU ATLAS
186,891
10
40
Sylvester HTA
423,404
2
47
Sylvester FLAME
700,629
5
95
31
Related work
 FLAME API [Bientinesi et al., 2005]

Ad-hoc notations
 Sequoia [Fatahalian et al., 2006]

Principal construct: task
 HPF [Hiranandani et al., 1992] and Co-Array Fortran [Numrich and Reid, 1998]


Tiles are used for distribution
Do not address different levels of memory hierarchy
 POOMA [Reynders,1996]

Tiles and shadow regions are accessed as a whole
 Global Arrays [Nieplocha et al., 2006]

SPMD programming model
32
Conclusion
 HTA makes tiles part of a language
 It provides generalized framework to express tiles.
 It increases productivity
 Less
index calculation, fewer variables, loops, simpler
function interface
 Dynamic partitioning
 Overlapped tiling
 Little performance degradation
33
34
Code example: 2D Jacobi
Without overlapped tiling
With overlapped tiling
• HTA takes care of allocation of shadow regions, data consistency
• Clean indexing syntax
35
Two types of stencil computation
 Concurrent computation


Each tile can be executed
independently
Example: Jacobi, MG
 Wavefront computation


The execution sequence of
tiles follows a certain order.
Example: LU, SSOR
 With the update on read
policy, minimal number of
communications is
achieved.
computation
computation
Shadow region update
in every iteration
Shadow region update
in second iteration
36