Transcript Document

MATLAB HPCS Extensions
Presented by: David
Padua
University of Illinois at Urbana-Champaign
PERCS Program Review
Contributors
•
•
•
•
•
•
Gheorghe Almasi - IBM Research
Calin Cascaval - IBM Research
Siddhartha Chatterjee - IBM Research
Basilio Fraguela - University of Illinois
Jose Moreira - IBM Research
David Padua - University of Illinois
PERCS Program Review
Objectives
• To develop MATLAB extensions for
accessing, prototyping, and implementing
scalable parallel algorithms.
• To give programmers of high-end machines
access to all the powerful features of
MATLAB, as a result.
– Array operations / kernels.
– Interactive interface.
– Rendering.
PERCS Program Review
Uses of the MATLAB Extension
•
•
•
•
Interface for parallel libraries users.
Interface for parallel library developers.
Input to a “conventional” compiler.
Input to linear algebra compiler.
– A library generator/tuner for parallel machines.
– Leverage NSF-ITR project with K. Pingali
(Cornell) and J. DeJong (Illinois).
PERCS Program Review
Design Requirements
• Minimal extension.
• A natural extension to MATLAB that is easy to
use.
• Extensions for direct control of parallelism and
communication on top of the ability to access
parallel library routines. It does not seem it is
possible to encapsulate all the important
parallelism in library routines.
• Extensions that provide the necessary information
and can be automatically and effectively analyzed
for compilation and translation.
PERCS Program Review
The Design
• No existing MATLAB extension had the
characteristics we needed.
• We designed a data type that we call
hierarchically tiled arrays (HTAs). These
are arrays whose components could be
arrays or other HTAs. Operators on HTAs
represent computations or communication.
PERCS Program Review
Approach
• In our approach, the programmer interacts
with a copy of MATLAB running on a
workstation.
• The workstation controls parallel
computation on servers.
PERCS Program Review
Approach (Cont.)
• All conventional MATLAB operations are
executed on the workstation.
• The parallel server operates on the HTAs.
• The HTA type is implemented as a
MATLAB toolbox. This enables
implementation as a language extension and
simplifies porting to future versions of
MATLAB.
PERCS Program Review
Interpretation and Compilation
• A first implementation based on the MATLAB
interpreter has been developed.
• This implementation will be used to improve our
understanding of the extensions and will serve as a
basis for further development and tuning.
• Interpretation overhead may hinder performance,
but parallelism can compensate for the overhead.
• Future work will include the implementation of a
compiler for MATLAB and our extensions based
on the effective strategies of L. DeRose and G.
Almasi.
PERCS Program Review
From: G. Almasi and D. Padua MaJIC: Compiling MATLAB for Speed and Responsiveness. PLDI 2002
PERCS Program Review
Hierarchically Tiled Arrays
• Array tiles are a powerful mechanism to enhance
locality in sequential computations and to
represent data distribution across parallel systems.
• Several levels of tiling are useful to distribute data
across parallel machines with a hierarchical
organization and to simultaneously represent both
data distribution and memory layout.
• For example, a two-level hierarchy of tiles can be
used to represent:
– the data distribution on a parallel system and
– the memory layout within each component.
PERCS Program Review
Hierarchically Tiled Arrays (Cont.)
• Computation and communication are
represented as array operations on HTAs.
• Using array operations for communication
and computation raises the level of
abstraction and, at the same time, facilitates
optimization.
PERCS Program Review
Using HTAs for Locality Enhancement
Tiled matrix multiplication using conventional arrays
for I=1:q:n
for J=1:q:n
for K=1:q:n
for i=I:I+q-1
for j=J:J+q-1
for k=K:K+q-1
C(i,j)=C(i,j)+A(i,k)*B(k,j);
end
end
end
end
end
end
PERCS Program Review
Using HTAs for Locality Enhancement
Tiled matrix multiplication using HTAs
• Here, C{i,j}, A{i,k},
B{k,j} represent
submatrices.
The * operator represents
matrix multiplication in
MATLAB.
for i=1:m
for j=1:m
for k=1:m
C{i,j}=C{i,j}+A{i,k}*B{k,j};
end
•
end
end
PERCS Program Review
Using HTAs to Represent Data
Distribution
and
Parallelism
Cannon’s Algorithm
A{1,1}
B{1,1}
A{1,2}
B{2,2}
A{1,3}
B{3,3}
A{1,4}
B{4,4}
A{2,2}
B{2,1}
A{2,3}
B{3,2}
A{2,4}
B{4,3}
A{2,1}
B{1,4}
A{3,3}
B{3,1}
A{3,4}
B{4,2}
A{3,1}
B{1,3}
A{3,2}
B{2,4}
A{4,4}
B{4,1}
A{4,1}
B{1,2}
A{4,2}
B{2,3}
A{4,3}
B{3,4}
PERCS Program Review
A{1,1}
B{1,1}
A{1,2}
B{2,2}
A{1,3}
B{3,3}
A{1,4}
B{4,4}
A{2,2}
B{2,1}
A{2,3}
B{3,2}
A{2,4}
B{4,3}
A{2,1}
B{1,4}
A{3,3}
B{3,1}
A{3,4}
B{4,2}
A{3,1}
B{1,3}
A{3,2}
B{2,4}
A{4,4}
B{4,1}
A{4,1}
B{1,2}
A{4,2}
B{2,3}
A{4,3}
B{3,4}
PERCS Program Review
A{1,2}
B{1,1}
A{1,3}
B{2,2}
A{1,4}
B{3,3}
A{1,1}
B{4,4}
A{2,3}
B{2,1}
A{2,4}
B{3,2}
A{2,1}
B{4,3}
A{2,2}
B{1,4}
A{3,4}
B{3,1}
A{3,1}
B{4,2}
A{3,2}
B{1,3}
A{3,3}
B{2,4}
A{4,1}
B{4,1}
A{4,2}
B{1,2}
A{4,3}
B{2,3}
A{4,4}
B{3,4}
PERCS Program Review
A{1,2}
B{2,1}
A{1,3}
B{3,2}
A{1,4}
B{4,3}
A{1,1}
B{1,4}
A{2,3}
B{3,1}
A{2,3}
B{4,2}
A{2,4}
B{1,3}
A{2,1}
B{2,4}
A{3,3}
B{4,1}
A{3,4}
B{1,2}
A{3,1}
B{2,3}
A{3,2}
B{3,4}
A{4,4}
B{1,1}
A{4,1}
B{2,2}
A{4,2}
B{3,3}
A{4,3}
B{4,4}
PERCS Program Review
Cannnon’s Algorithm in MATLAB
with HPCS Extensions
C{1:n,1:n} = zeros(p,p);
%communication
…
for k=1:n
C{:,:} = C{:,:}+A{:,:}*B{:,:}; %computation
end
A{i,1:n} = A{i,[2:n, 1]};
%communication
B{1:n,i} = B{[2:n,1],i};
%communication
PERCS Program Review
Cannnon’s Algorithm in C + MPI
for (km = 0; km < m; km++) {
char *chn = "T";
dgemm(chn, chn, lclMxSz, lclMxSz, lclMxSz, 1.0, a, lclMxSz, b, lclMxSz, 1.0, c, lclMxSz);
MPI_Isend(a, lclMxSz * lclMxSz, MPI_DOUBLE, destrow, ROW_SHIFT_TAG, MPI_COMM_WORLD, &requestrow);
MPI_Isend(b, lclMxSz * lclMxSz, MPI_DOUBLE, destcol, COL_SHIFT_TAG, MPI_COMM_WORLD, &requestcol);
MPI_Recv(abuf, lclMxSz * lclMxSz, MPI_DOUBLE, MPI_ANY_SOURCE, ROW_SHIFT_TAG, MPI_COMM_WORLD, &status);
MPI_Recv(bbuf, lclMxSz * lclMxSz, MPI_DOUBLE, MPI_ANY_SOURCE, COL_SHIFT_TAG, MPI_COMM_WORLD, &status);
MPI_Wait(&requestrow, &status);
aptr = a;
a = abuf;
abuf = aptr;
MPI_Wait(&requestcol, &status);
bptr = b;
b = bbuf;
bbuf = bptr;
}
PERCS Program Review
Speedups on a
four-processor IBM SP-2
PERCS Program Review
Speedups on a
nine-processor IBM SP-2
PERCS Program Review
Flattening
• Elements of an HTA are referenced using a tile index for
each level in the hierarchy followed by an array index.
Each tile index tuple is enclosed within {}s and the array
index is enclosed within parentheses.
– In the matrix multiplication code, C{i,j}(3,4) would represent
element 3,4 of submatrix i,j.
• Alternatively, the tiled array could be accessed as a flat
array as shown in the next slide.
• This feature is useful when a global view of the array is
needed in the algorithm. It is also useful while
transforming a sequential code into parallel form.
PERCS Program Review
Two Ways of Referencing the
Elements of an 8 x 8 Array.
C{1,1}(1,1)
C(1,1)
C{1,1}(1,2)
C(1,2)
C{1,1}(1,3)
C(1,3)
C{1,1}(1,4)
C(1,4)
C{1,2}(1,1)
C(1,5)
C{1,2}(1,2)
C(1,6)
C{1,2}(1,3)
C(1,7)
C{1,2}(1,4)
C(1,8)
C{1,1}(2,1)
C(2,1)
C{1,1}(2,2)
C(2,2)
C{1,1}(2,3)
C(2,3)
C{1,1}(2,4)
C(2,4)
C{1,2}(2,1)
C(2,5)
C{1,2}(2,2)
C(2,6)
C{1,2}(2,3)
C(2,7)
C{1,2}(2,4)
C(2,8)
C{1,1}(3,1)
C(3,1)
C{1,1}(3,2)
C(3,2)
C{1,1}(3,3)
C(3,3)
C{1,1}(3,4)
C(3,4)
C{1,2}(3,1)
C(3,5)
C{1,2}(3,2)
C(3,6)
C{1,2}(3,3)
C(3,7)
C{1,2}(3,4)
C(3,8)
C{1,1}(4,1)
C(4,1)
C{1,1}(4,2)
C(4,2)
C{1,1}(4,3)
C(4,3)
C{1,1}(4,4)
C(4,4)
C{1,2}(4,1)
C(4,5)
C{1,2}(4,2)
C(4,6)
C{1,2}(4,3)
C(4,7)
C{1,2}(4,4)
C(4,8)
C{2,1}(1,1)
C(5,1)
C{2,1}(1,2)
C(5,2)
C{2,1}(1,3)
C(5,3)
C{2,1}(1,4)
C(5,4)
C{2,2}(1,1)
C(5,5)
C{2,2}(1,2)
C(5,6)
C{2,2}(1,3)
C(5,7)
C{2,2}(1,4)
C(5,8)
C{2,1}(2,1)
C(6,1)
C{2,1}(2,2)
C(6,2)
C{2,1}(2,3)
C(6,3)
C{2,1}(2,4)
C(6,4)
C{2,2}(2,1)
C(6,5)
C{2,2}(2,2)
C(6,6)
C{2,2}(2,3)
C(6,7)
C{2,2}(2,4)
C(6,8)
C{2,1}(3,1)
C(7,1)
C{2,1}(3,2)
C(7,2)
C{2,1}(3,3)
C(7,3)
C{2,1}(3,4)
C(7,4)
C{2,2}(3,1)
C(7,5)
C{2,2}(3,2)
C(7,6)
C{2,2}(3,3)
C(7,7)
C{2,2}(3,4)
C(7,8)
C{2,1}(4,1)
C(8,1)
C{2,1}(4,2)
C(8,2)
C{2,1}(4,3)
C(8,3)
C{2,1}(4,4)
C(8,4)
C{2,2}(4,1)
C(8,5)
C{2,2}(4,2)
C(8,6)
C{2,2}(4,3)
C(8,7)
C{2,2}(4,4)
C(8,8)
PERCS Program Review
Status
• We have completed the implementation of
practically all of our initial language
extensions (for IBM SP-2 and Linux
Clusters).
• Following the toolbox approach has been a
challenge, but we have been able to
overcome all obstacles.
PERCS Program Review
Conclusions
• We have developed parallel extensions to
MATLAB.
• It is possible to write highly readable
parallel code for both dense and sparse
computations with these extensions.
• The HTA objects and operations have been
implemented as a MATLAB toolbox which
enabled their implementation as language
extensions.
PERCS Program Review