Plan of lectures

Download Report

Transcript Plan of lectures

Plan of lectures



1. Development of data parallel programming.
Historical view of languages up to HPF.
2. Issues in translation of HPF. Simple
examples. Distributed array descriptors.
3. Communication in data parallel languages.
Communication patterns. Runtime libraries.


4. An “HPspmd” programming model.
Motivations. Introduction to HPJava.
5. Java for HP computing. Java Grande.
mpiJava.
Lecture materials
http://www.npac.syr.edu/projects/pcrc/HPJava/beijing.html
Development of Data-Parallel
Programming
Bryan Carpenter
NPAC at Syracuse University
Syracuse, NY 13244
[email protected]
Goals of this lecture


Review the historical development of
data-parallel programming, up to and
including emergence of High
Performance Fortran.
Illustrate evolution of ideas by
describing some major practical
programming languages for data
parallelism.
Contents of Lecture

Overview.


Early data-parallel languages.


ILLIAC IV CFD and DAP FORTRAN.
Later SIMD languages.



Need for massive parallelism: SIMD, MIMD, SPMD.
Fortran 90
Connection Machine Fortran, related languages.
High Performance Fortran.



Multi-processing and distributed data
Processor arrangements and templates, alignment
Programming examples.
Moore’s law and massive
parallelism



Microprocessors double in speed every
eighteen months.
To outpace growth in power of
sequential computers, parallelism better
offer orders of magnitude better
performance.
Moderate parallelism cannot give this
speedup—massive parallelism or bust!
Niches for massive parallelism
today



Dedicated clusters of commodity
processors. Government labs, internet
servers, Pixar Renderfarm, …
Harvesting cycles of processors
distributed over the internet.
Two principal forms of massive
parallelism: task farming, and data
parallelism.
Data parallelism




Large data structures, typically arrays, split
across computational nodes.
Each node computes local patch of arrays.
Typically need some intermediate results from
other nodes, but communication should not
dominate.
Logical model of fine-grained parallelism—at
level of array elements, rather than tasks.
Programming languages for
massive parallelism

Task farming adequately expressed in
conventional languages



task = function call
libraries handle communication + load-balancing
Data parallelism can be expressed in
conventional languages + MPI (say), but
historically higher level languages have been
quite successful

Potentially avoid the problems of general
concurrent programming (non-determinism,
synchronization, deadlock, …)
Why special languages for
data-parallelism?




Originally devised for Single Instruction,
Multiple Data (SIMD) computers.
Specialized architectures demanded
special language features.
New abstraction: distributed array.
Central control avoided problems of
concurrent programming—concentrate
on expressing parallel algorithms.
Advent of MIMD computers.



By 90s, SIMD lost ground—general purpose
microprocessors too cheap, powerful.
Multiple Instruction, Multiple Data (MIMD)
computers could be built from commodity
components.
Still want to avoid complexity of general
concurrent programming. Single Program,
Multiple Data (SPMD)—now a programming
model rather than computer architecture.
Similarities to SIMD programming suggest
similar language ideas should apply.
The road to HPF


High Performance Fortran took ideas
from SIMD languages—embedded them
in a standardized language for
programming MIMD computers in SPMD
style.
Next section of this lecture will review
historically important SIMD languages—
learn where the ideas came from.
ILLIAC IV





ILLIAC IV development started late 60s. Fully
operational 1975. Illinois then NASA Ames.
SIMD computer for array processing.
Control Unit + 64 Processing elements. 2K
words memory per PE.
CU can access all memory. PEs can access
local memory and communicate with
neighbours.
CU reads program text—broadcasts
instructions to PEs.
Architecture of the ILLIAC IV
ILLIAC IV CFD




Software problematic for ILLIAC.
CFD language developed at Computational
Fluid Dynamics Branch of Ames.
“Fortran-like” rather than a true FORTRAN
dialect.
Deliberately no attempt to hide hardware
peculiarities—on contrary, tried to give
programmer direct access to all features of
the hardware.
Data types in CFD


Basic types included CU REAL, CU INTEGER,
PE REAL, PE INTEGER.
Ordinary arrays can be in PE or CU memory:
CU REAL A, B(100)
PE INTEGER I
PE REAL D(25), E(1000)

Vector-aligned arrays in PE memory:
PE INTEGER J(*)
PE REAL X(*,4), Y(*,2,8)
Vector aligned arrays in CFD




Early language instantiation of Distributed
Array concept.
Only first dimension can be distributed.
Extent of distributed dimension exactly 64.
J(1) stored on first PE, J(2) on second PE, etc.
X(1,1),…,X(1,4) on first PE, X(2,1),…, X(2,4)
on second PE, etc.
Parallel computation in CFD


Parallel computation only in vector
assignments. Subscript is *.
Communication between PEs by adding shift
to *:
DIFP(*) = P(* + 1) – P(* - 1)

Conditional vector assignment:
IF(A(*) .LT. 0) A(*) = -A(*)

PEs can locally access different locations:
DIAG(*) = RHO(*, X(*))
Vector subscript must be in sequential
dimension.
Other ILLIAC languages

Glypnir



Algol-like rather than Fortran-like.
swords (“super-words”) rather than explicit
parallel arrays.
IVTRAN




Higher-level Fortran dialect.
Scheme for mapping general arrays to PEs.
Parallel loop similar to FORALL.
Compiler never fully debugged.
ICL DAP FORTRAN





ICL DAP (Distributed Array Processor)—early
commercial parallel computer. Available early
80s.
Trend to many, simple PEs: 64 × 64 grid of
4096 bit-serial processors.
FP performance much lower than ILLIAC, but
good performance in other domains.
Module of ICL 2900 mainframe computer—
shared memory and other facilities with host.
Programmed in DAP FORTRAN.
Constrained arrays in DAP
FORTRAN

First one or two extents could be omitted:
DIMENSION A(), BB(,)
INTEGER II(,)
REAL AA(,3), BBB(,,5)




A is a vector.
BB and II are matrices.
AA an array of vectors. BBB an array of
matrices.
Extents of constrained dimensions 64 again
(vectors of 4096 also possible).
Array assignments in DAP
Fortran

Leave subscripts empty for array expression.
+ or – subscript used for nearest neighbour
access:
U(,) = 0.25 * (U(,-) + U(,+) + U(-,) + U(+,))

Masked assignment by using logical matrix as
subscript:
LOGICAL L(,)
...
L(,) = BB(,) .LT. 0
B(L) = -B(,)
Standardized array processing:
Fortran 90


By the mid 80s SIMD and vector computers
looked well-established. The Fortran 8X
committee saw a need to make array
assignments standard language features.
Fortran 90 (as it eventually became) added
many new features to support modern
software engineering. It also added new
“array syntax”, including array assignments,
array sections, transformational intrinsics.
Fortran 90 array assignments

Rank-2 arrays, shape (5, 10):
REAL A(5, 10), B(5, 10), C(5, 10)

Array expressions involve conforming arrays:
A + B,
B * C,
A + (B * C)
Arrays conform if they have same shape.

B * C, for example, stands for
B(1,1) * C(1,1) . . .
·
·
B(5,1) * C(5,1) . . .
B(1,10) * C(1,10)
·
B(5,10) * C(5,10)
Scalars in array expressions

Scalars conform with any array:
REAL D
Expression C + D stands for:
C(1,1) + D . . .
·
·
C(5,1) + D . . .
C(1,10) + D
·
C(5,10) + D
Array sections

Array subscripts can be triplets:


X(1:10:3) stands for the subarray:
(X(1), X(4), X(7), X(10))
Special cases of triplets:


X(1:4), stride defaults to 1:
(X(1), X(2), X(3), X(4))
X(:) selects whole of array dimension, as originally
declared. Most useful for multi-dimensional
arrays—here X(:) just equivalent to X.
Array assignments

Simple assignment:
U(2:N-1) = 0.5 * (U(1:N-2) + U(3:N))
Right-hand-side expression must conform
with left-hand-side variable.

Conditional assignment:
WHERE (A .LT. 0.0) A = -A
The condition must be a logical array that
conforms with the left-hand-side of the
assignment.
Array intrinsics





Fortran 90 added many transformational
intrinsics—operations on whole arrays.
CSHIFT, EOSHIFT shift arrays in a dimension.
TRANSPOSE transposes array.
RESHAPE: general reshaping of array.
SUM, PRODUCT, MAXVAL, MINVAL, . . . :
reduction operations that take an array and
return scalar or reduced-rank array.
The Connection Machine



Thinking Machines Corp founded 1983 to
supply connectionist computers for AI.
In practise, their Connection Machine series
was used largely for scientific computing.
CM-2 launched 1986:




SIMD architecture, bit-serial PE like DAP.
Up to 65536 PEs connected on a hypercube
network.
Floating point coprocessors could give peak
performance 28 GFLOPS.
Programmed initially in *LISP, then C*. By
1990 there was a CM Fortran compiler.
CM Fortran




Extended from FORTRAN 77.
Included all the new array syntax of
Fortran 90.
FORALL statement.
Array distribution directives.
Mapping arrays in CM Fortran



Arrays can be allocated on front-end
(VAX, Sun,etc) or on the CM itself.
By default an array is a CM array if it
ever used in a F90 array assignment in
the current procedure; it is a front-end
array if it is only ever used in F77 style.
Distributed arrays (“CM arrays”) can
have any shape.
Layout of CM arrays


Arrays mapped to Virtual Processor sets (VP
sets), one array element per VP.
By default VP set has minimal shape s.t.:





VP grid is large enough to hold the array,
Extents of VP grid are powers of 2,
Total number VPs is a multiple of number PEs.
Map first array element to first VP.
Disable or ignore VPs unneeded to hold
elements.
LAYOUTdirectives in CM
Fortran


Layout directive can override defaults.
Serial dimensions can be elected:
REAL A(10, 64, 100)
CMF$ LAYOUT A(:SERIAL, :NEWS, :NEWS)

A aligned with B:
REAL B(64, 100)
CMF$ LAYOUT B(:NEWS, :NEWS)


All dimensions serial implies front-end array.
LAYOUT directives can appear in procedure
interface blocks.
ALIGN directives in CM Fortran

Can explicitly specify alignment relations:
REAL V(100), B(64, 100)
CMF$ ALIGN V(I) WITH B(1, I)

Offset alignments also allowed:
REAL C(32, 50)
CMF$ ALIGN C(I, J) WITH B(I+5, J+2)

More general alignments, eg transposed, not
allowed.
Layouts and alignments in CM
Fortran
FORALL

Parallelism must be explicit. Eg in F90 style:
U(2:N-1, 2:N-1) =
&
0.25 * (U(2:N-1, 1:N-2) + U(2:N-1, 3:N)
&
+ U(1:N-2, 2:N-1) + U(3:N, 2:N-1))

When array syntax unwieldy, can use FORALL
statement:
FORALL (I = 2:N-1, J = 2:N-1)
&
U(I, J) = 0.25 * (U(I, J-1) + U(I, J+1) +
&
U(I-1, J) + U(I+1, J))
Languages related to CM
Fortran



MasPar Fortran. Contemporary SIMD
computer corporation. Language very similar
to CM Fortran. Slightly different syntax for
directives.
*LISP. Original language of CM. Data
parallel dialect of Common LISP. pvars rather
than distributed arrays.
C*. CM version of C, extended with syntax
for distributed arrays. Introduced shape
concept, similar to templates in later HPF.
The High Performance Fortran
Forum



By early 90s, value of portable, standardized
languages universally acknowledged.
Goal of HPF Forum—a single language for
High Performance programming. Effective
across architectures—vector, SIMD, MIMD,
though SPMD a focus.
Supported by most major vendors then:


Cray, DEC, Fujitsu, HP, IBM, Intel, MasPar, Meiko,
nCube, Sun, Thinking Machines
HPF 1.0 standard published 1993.
Multiprocessing and
Distributed Data

Contemporary parallel computers are built
from autonomous processors with some local
memory. Processors access their local
memory rapidly, or other processors memory
much more slowly. Programs should be
arranged to minimize non-local accesses.
Processors
Memory
areas
Excessive communication

Computation divided into parallel
subcomputations, each computation involving
operands from multiple processors.
Bad load balancing

All parallel subcomputations access
operands on the same processor.
Ideal decomposition

All operands of individual subcomputation
on single processor; operands for different
tasks on different processors.
Placement of data and
computation


HPF allows intricate control over
placement of array elements, through
mapping directives.
HPF 1.0 allows no direct control over
placement of computation. Compiler
should infer good place to compute,
according to location of operands (eg,
“owner computes” heuristic).
Stages of data mapping in
HPF
Processor arrangements


Abstract, program-level representation of
processor set over which data is distributed.
Directive syntax similar to Fortran array
declaration:
!HPF$ PROCESSORS P(10)

Set P contains 10 processors.
Processor arrangements can be multidimensional:
!HPF$ PROCESSORS Q(4, 4)
Templates


Abstract array shape, to which actual data
can be aligned. (Usually much finer grain
than processor arrangements.)
Declaration syntax:
!HPF$ TEMPLATE T(50, 50, 50)

Templates distributed over processor
arrangements using DISTRIBUTE directive.
Following examples assume:
!HPF$ PROCESSORS P1(4)
!HPF$ TEMPLATE T1(17)
Simple block distribution
!HPF$ DISTRIBUTE T1(BLOCK) ONTO P1
Simple cyclic distribution
!HPF$ DISTRIBUTE T1(CYCLIC) ONTO P1
Block distribution with
specified block-size
!HPF$ DISTRIBUTE T1(BLOCK(6)) ONTO P1
Block-cyclic distribution
!HPF$ DISTRIBUTE T1(CYCLIC(3)) ONTO P1
Distributing a
multidimensional template

Distribution formats can be mixed:
!HPF$ PROCESSORS P2(4, 3)
!HPF$ TEMPLATE T2(17, 20)
!HPF$ DISTRIBUTE T2(CYCLIC, BLOCK) ONTO P2

Some dimensions may be serial, or collapsed:
!HPF$ DISTRIBUTE T2(BLOCK, *) ONTO P1
LU Decomposition algorithm
REAL A(N, N)
DO K = 1, N – 1
DO J = K + 1, N
A(K, J) = A(K, J) / A(K, K)
DO I = K + 1, N
A(I, J) = A(I, J) – A(I, K) * A(K, J)
ENDDO
ENDDO
ENDDO
Parallel LU Decomposition
REAL A(N, N)
REAL COL(N), ROW(N)
DO K = 1, N – 1
COL(K:N) = A(K:N, K)
A(K, K+1:N) = A(K, K+1:N) / COL(K)
ROW(K+1:N) = A(K, K+1:N)
FORALL (I = K+1:N, J = K+1:N)
&
A(I, J) = A(I, J) – COL(I) * ROW(J)
ENDDO
Simple align directive

Create template matching principal
array (A), then ALIGN the array to the
template:
!HPF$ TEMPLATE T(N, N)
!HPF$ ALIGN A(I, J) WITH T(I, J)
Aligning auxiliary arrays

Want to minimize communication in largest
parallel loop:
FORALL (I = K+1:N, J = K+1:N)
&
A(I, J) = A(I, J) – COL(I) * ROW(J)

Suggests replicated alignments for COL and
ROW:
!HPF$ ALIGN COL(I) WITH T(I, *)
!HPF$ ALIGN COL(J) WITH T(*, J)
Alignment of arrays in LU
example
Communication in LU example

The statement
COL(K:N) = A(K:N, K)
broadcasts Kth column of A.

The statement
A(K, K+1:N) = A(K, K+1:N) / COL(K)
needs no communication (reason for using
COL(K) rather than A(K, K)).

The statement
ROW(K+1:N) = A(K, K+1:N)
broadcasts Kth row of A.
Distribution in LU example

Block-wise distribution unattractive due to
poor load balancing. Cyclic distribution
preferable:
!HPF$ PROCESSORS P(NP, NP)
!HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P
LU example with directives
!HPF$ PROCESSORS P(NP, NP)
!HPF$ TEMPLATE T(N, N)
!HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P
REAL A(N, N)
!HPF$ ALIGN A(I, J) WITH T(I, J)
REAL COL(N), ROW(N)
!HPF$ ALIGN COL(I) WITH T(I, *)
!HPF$ ALIGN COL(J) WITH T(*, J)
DO K = 1, N – 1
COL(K:N) = A(K:N, K)
A(K, K+1:N) = A(K, K+1:N) / COL(K)
ROW(K+1:N) = A(K, K+1:N)
FORALL (I = K+1:N, J = K+1:N)
&
A(I, J) = A(I, J) – COL(I) * ROW(J)
ENDDO
Other alignment options

Permuted dimensions:
DIMENSION B(N, N)
!HPF$ ALIGN B(I, J) WITH T(J, I)

Affine intra-dimensional alignment:
DIMENSION C(N/2, N/2)
!HPF$ ALIGN C(I, J) WITH T(N/2 + I, 2*J)
Intra-dimensional alignment
example
Next Lecture:

Issues in translation of High
Performance Fortran


Translation of simple HPF programs to
SPMD (MPI) programs.
Design of a runtime Distributed Array
Descriptor.