Plan of lectures
Download
Report
Transcript Plan of lectures
Plan of lectures
1. Development of data parallel programming.
Historical view of languages up to HPF.
2. Issues in translation of HPF. Simple
examples. Distributed array descriptors.
3. Communication in data parallel languages.
Communication patterns. Runtime libraries.
4. An “HPspmd” programming model.
Motivations. Introduction to HPJava.
5. Java for HP computing. Java Grande.
mpiJava.
Lecture materials
http://www.npac.syr.edu/projects/pcrc/HPJava/beijing.html
Development of Data-Parallel
Programming
Bryan Carpenter
NPAC at Syracuse University
Syracuse, NY 13244
[email protected]
Goals of this lecture
Review the historical development of
data-parallel programming, up to and
including emergence of High
Performance Fortran.
Illustrate evolution of ideas by
describing some major practical
programming languages for data
parallelism.
Contents of Lecture
Overview.
Early data-parallel languages.
ILLIAC IV CFD and DAP FORTRAN.
Later SIMD languages.
Need for massive parallelism: SIMD, MIMD, SPMD.
Fortran 90
Connection Machine Fortran, related languages.
High Performance Fortran.
Multi-processing and distributed data
Processor arrangements and templates, alignment
Programming examples.
Moore’s law and massive
parallelism
Microprocessors double in speed every
eighteen months.
To outpace growth in power of
sequential computers, parallelism better
offer orders of magnitude better
performance.
Moderate parallelism cannot give this
speedup—massive parallelism or bust!
Niches for massive parallelism
today
Dedicated clusters of commodity
processors. Government labs, internet
servers, Pixar Renderfarm, …
Harvesting cycles of processors
distributed over the internet.
Two principal forms of massive
parallelism: task farming, and data
parallelism.
Data parallelism
Large data structures, typically arrays, split
across computational nodes.
Each node computes local patch of arrays.
Typically need some intermediate results from
other nodes, but communication should not
dominate.
Logical model of fine-grained parallelism—at
level of array elements, rather than tasks.
Programming languages for
massive parallelism
Task farming adequately expressed in
conventional languages
task = function call
libraries handle communication + load-balancing
Data parallelism can be expressed in
conventional languages + MPI (say), but
historically higher level languages have been
quite successful
Potentially avoid the problems of general
concurrent programming (non-determinism,
synchronization, deadlock, …)
Why special languages for
data-parallelism?
Originally devised for Single Instruction,
Multiple Data (SIMD) computers.
Specialized architectures demanded
special language features.
New abstraction: distributed array.
Central control avoided problems of
concurrent programming—concentrate
on expressing parallel algorithms.
Advent of MIMD computers.
By 90s, SIMD lost ground—general purpose
microprocessors too cheap, powerful.
Multiple Instruction, Multiple Data (MIMD)
computers could be built from commodity
components.
Still want to avoid complexity of general
concurrent programming. Single Program,
Multiple Data (SPMD)—now a programming
model rather than computer architecture.
Similarities to SIMD programming suggest
similar language ideas should apply.
The road to HPF
High Performance Fortran took ideas
from SIMD languages—embedded them
in a standardized language for
programming MIMD computers in SPMD
style.
Next section of this lecture will review
historically important SIMD languages—
learn where the ideas came from.
ILLIAC IV
ILLIAC IV development started late 60s. Fully
operational 1975. Illinois then NASA Ames.
SIMD computer for array processing.
Control Unit + 64 Processing elements. 2K
words memory per PE.
CU can access all memory. PEs can access
local memory and communicate with
neighbours.
CU reads program text—broadcasts
instructions to PEs.
Architecture of the ILLIAC IV
ILLIAC IV CFD
Software problematic for ILLIAC.
CFD language developed at Computational
Fluid Dynamics Branch of Ames.
“Fortran-like” rather than a true FORTRAN
dialect.
Deliberately no attempt to hide hardware
peculiarities—on contrary, tried to give
programmer direct access to all features of
the hardware.
Data types in CFD
Basic types included CU REAL, CU INTEGER,
PE REAL, PE INTEGER.
Ordinary arrays can be in PE or CU memory:
CU REAL A, B(100)
PE INTEGER I
PE REAL D(25), E(1000)
Vector-aligned arrays in PE memory:
PE INTEGER J(*)
PE REAL X(*,4), Y(*,2,8)
Vector aligned arrays in CFD
Early language instantiation of Distributed
Array concept.
Only first dimension can be distributed.
Extent of distributed dimension exactly 64.
J(1) stored on first PE, J(2) on second PE, etc.
X(1,1),…,X(1,4) on first PE, X(2,1),…, X(2,4)
on second PE, etc.
Parallel computation in CFD
Parallel computation only in vector
assignments. Subscript is *.
Communication between PEs by adding shift
to *:
DIFP(*) = P(* + 1) – P(* - 1)
Conditional vector assignment:
IF(A(*) .LT. 0) A(*) = -A(*)
PEs can locally access different locations:
DIAG(*) = RHO(*, X(*))
Vector subscript must be in sequential
dimension.
Other ILLIAC languages
Glypnir
Algol-like rather than Fortran-like.
swords (“super-words”) rather than explicit
parallel arrays.
IVTRAN
Higher-level Fortran dialect.
Scheme for mapping general arrays to PEs.
Parallel loop similar to FORALL.
Compiler never fully debugged.
ICL DAP FORTRAN
ICL DAP (Distributed Array Processor)—early
commercial parallel computer. Available early
80s.
Trend to many, simple PEs: 64 × 64 grid of
4096 bit-serial processors.
FP performance much lower than ILLIAC, but
good performance in other domains.
Module of ICL 2900 mainframe computer—
shared memory and other facilities with host.
Programmed in DAP FORTRAN.
Constrained arrays in DAP
FORTRAN
First one or two extents could be omitted:
DIMENSION A(), BB(,)
INTEGER II(,)
REAL AA(,3), BBB(,,5)
A is a vector.
BB and II are matrices.
AA an array of vectors. BBB an array of
matrices.
Extents of constrained dimensions 64 again
(vectors of 4096 also possible).
Array assignments in DAP
Fortran
Leave subscripts empty for array expression.
+ or – subscript used for nearest neighbour
access:
U(,) = 0.25 * (U(,-) + U(,+) + U(-,) + U(+,))
Masked assignment by using logical matrix as
subscript:
LOGICAL L(,)
...
L(,) = BB(,) .LT. 0
B(L) = -B(,)
Standardized array processing:
Fortran 90
By the mid 80s SIMD and vector computers
looked well-established. The Fortran 8X
committee saw a need to make array
assignments standard language features.
Fortran 90 (as it eventually became) added
many new features to support modern
software engineering. It also added new
“array syntax”, including array assignments,
array sections, transformational intrinsics.
Fortran 90 array assignments
Rank-2 arrays, shape (5, 10):
REAL A(5, 10), B(5, 10), C(5, 10)
Array expressions involve conforming arrays:
A + B,
B * C,
A + (B * C)
Arrays conform if they have same shape.
B * C, for example, stands for
B(1,1) * C(1,1) . . .
·
·
B(5,1) * C(5,1) . . .
B(1,10) * C(1,10)
·
B(5,10) * C(5,10)
Scalars in array expressions
Scalars conform with any array:
REAL D
Expression C + D stands for:
C(1,1) + D . . .
·
·
C(5,1) + D . . .
C(1,10) + D
·
C(5,10) + D
Array sections
Array subscripts can be triplets:
X(1:10:3) stands for the subarray:
(X(1), X(4), X(7), X(10))
Special cases of triplets:
X(1:4), stride defaults to 1:
(X(1), X(2), X(3), X(4))
X(:) selects whole of array dimension, as originally
declared. Most useful for multi-dimensional
arrays—here X(:) just equivalent to X.
Array assignments
Simple assignment:
U(2:N-1) = 0.5 * (U(1:N-2) + U(3:N))
Right-hand-side expression must conform
with left-hand-side variable.
Conditional assignment:
WHERE (A .LT. 0.0) A = -A
The condition must be a logical array that
conforms with the left-hand-side of the
assignment.
Array intrinsics
Fortran 90 added many transformational
intrinsics—operations on whole arrays.
CSHIFT, EOSHIFT shift arrays in a dimension.
TRANSPOSE transposes array.
RESHAPE: general reshaping of array.
SUM, PRODUCT, MAXVAL, MINVAL, . . . :
reduction operations that take an array and
return scalar or reduced-rank array.
The Connection Machine
Thinking Machines Corp founded 1983 to
supply connectionist computers for AI.
In practise, their Connection Machine series
was used largely for scientific computing.
CM-2 launched 1986:
SIMD architecture, bit-serial PE like DAP.
Up to 65536 PEs connected on a hypercube
network.
Floating point coprocessors could give peak
performance 28 GFLOPS.
Programmed initially in *LISP, then C*. By
1990 there was a CM Fortran compiler.
CM Fortran
Extended from FORTRAN 77.
Included all the new array syntax of
Fortran 90.
FORALL statement.
Array distribution directives.
Mapping arrays in CM Fortran
Arrays can be allocated on front-end
(VAX, Sun,etc) or on the CM itself.
By default an array is a CM array if it
ever used in a F90 array assignment in
the current procedure; it is a front-end
array if it is only ever used in F77 style.
Distributed arrays (“CM arrays”) can
have any shape.
Layout of CM arrays
Arrays mapped to Virtual Processor sets (VP
sets), one array element per VP.
By default VP set has minimal shape s.t.:
VP grid is large enough to hold the array,
Extents of VP grid are powers of 2,
Total number VPs is a multiple of number PEs.
Map first array element to first VP.
Disable or ignore VPs unneeded to hold
elements.
LAYOUTdirectives in CM
Fortran
Layout directive can override defaults.
Serial dimensions can be elected:
REAL A(10, 64, 100)
CMF$ LAYOUT A(:SERIAL, :NEWS, :NEWS)
A aligned with B:
REAL B(64, 100)
CMF$ LAYOUT B(:NEWS, :NEWS)
All dimensions serial implies front-end array.
LAYOUT directives can appear in procedure
interface blocks.
ALIGN directives in CM Fortran
Can explicitly specify alignment relations:
REAL V(100), B(64, 100)
CMF$ ALIGN V(I) WITH B(1, I)
Offset alignments also allowed:
REAL C(32, 50)
CMF$ ALIGN C(I, J) WITH B(I+5, J+2)
More general alignments, eg transposed, not
allowed.
Layouts and alignments in CM
Fortran
FORALL
Parallelism must be explicit. Eg in F90 style:
U(2:N-1, 2:N-1) =
&
0.25 * (U(2:N-1, 1:N-2) + U(2:N-1, 3:N)
&
+ U(1:N-2, 2:N-1) + U(3:N, 2:N-1))
When array syntax unwieldy, can use FORALL
statement:
FORALL (I = 2:N-1, J = 2:N-1)
&
U(I, J) = 0.25 * (U(I, J-1) + U(I, J+1) +
&
U(I-1, J) + U(I+1, J))
Languages related to CM
Fortran
MasPar Fortran. Contemporary SIMD
computer corporation. Language very similar
to CM Fortran. Slightly different syntax for
directives.
*LISP. Original language of CM. Data
parallel dialect of Common LISP. pvars rather
than distributed arrays.
C*. CM version of C, extended with syntax
for distributed arrays. Introduced shape
concept, similar to templates in later HPF.
The High Performance Fortran
Forum
By early 90s, value of portable, standardized
languages universally acknowledged.
Goal of HPF Forum—a single language for
High Performance programming. Effective
across architectures—vector, SIMD, MIMD,
though SPMD a focus.
Supported by most major vendors then:
Cray, DEC, Fujitsu, HP, IBM, Intel, MasPar, Meiko,
nCube, Sun, Thinking Machines
HPF 1.0 standard published 1993.
Multiprocessing and
Distributed Data
Contemporary parallel computers are built
from autonomous processors with some local
memory. Processors access their local
memory rapidly, or other processors memory
much more slowly. Programs should be
arranged to minimize non-local accesses.
Processors
Memory
areas
Excessive communication
Computation divided into parallel
subcomputations, each computation involving
operands from multiple processors.
Bad load balancing
All parallel subcomputations access
operands on the same processor.
Ideal decomposition
All operands of individual subcomputation
on single processor; operands for different
tasks on different processors.
Placement of data and
computation
HPF allows intricate control over
placement of array elements, through
mapping directives.
HPF 1.0 allows no direct control over
placement of computation. Compiler
should infer good place to compute,
according to location of operands (eg,
“owner computes” heuristic).
Stages of data mapping in
HPF
Processor arrangements
Abstract, program-level representation of
processor set over which data is distributed.
Directive syntax similar to Fortran array
declaration:
!HPF$ PROCESSORS P(10)
Set P contains 10 processors.
Processor arrangements can be multidimensional:
!HPF$ PROCESSORS Q(4, 4)
Templates
Abstract array shape, to which actual data
can be aligned. (Usually much finer grain
than processor arrangements.)
Declaration syntax:
!HPF$ TEMPLATE T(50, 50, 50)
Templates distributed over processor
arrangements using DISTRIBUTE directive.
Following examples assume:
!HPF$ PROCESSORS P1(4)
!HPF$ TEMPLATE T1(17)
Simple block distribution
!HPF$ DISTRIBUTE T1(BLOCK) ONTO P1
Simple cyclic distribution
!HPF$ DISTRIBUTE T1(CYCLIC) ONTO P1
Block distribution with
specified block-size
!HPF$ DISTRIBUTE T1(BLOCK(6)) ONTO P1
Block-cyclic distribution
!HPF$ DISTRIBUTE T1(CYCLIC(3)) ONTO P1
Distributing a
multidimensional template
Distribution formats can be mixed:
!HPF$ PROCESSORS P2(4, 3)
!HPF$ TEMPLATE T2(17, 20)
!HPF$ DISTRIBUTE T2(CYCLIC, BLOCK) ONTO P2
Some dimensions may be serial, or collapsed:
!HPF$ DISTRIBUTE T2(BLOCK, *) ONTO P1
LU Decomposition algorithm
REAL A(N, N)
DO K = 1, N – 1
DO J = K + 1, N
A(K, J) = A(K, J) / A(K, K)
DO I = K + 1, N
A(I, J) = A(I, J) – A(I, K) * A(K, J)
ENDDO
ENDDO
ENDDO
Parallel LU Decomposition
REAL A(N, N)
REAL COL(N), ROW(N)
DO K = 1, N – 1
COL(K:N) = A(K:N, K)
A(K, K+1:N) = A(K, K+1:N) / COL(K)
ROW(K+1:N) = A(K, K+1:N)
FORALL (I = K+1:N, J = K+1:N)
&
A(I, J) = A(I, J) – COL(I) * ROW(J)
ENDDO
Simple align directive
Create template matching principal
array (A), then ALIGN the array to the
template:
!HPF$ TEMPLATE T(N, N)
!HPF$ ALIGN A(I, J) WITH T(I, J)
Aligning auxiliary arrays
Want to minimize communication in largest
parallel loop:
FORALL (I = K+1:N, J = K+1:N)
&
A(I, J) = A(I, J) – COL(I) * ROW(J)
Suggests replicated alignments for COL and
ROW:
!HPF$ ALIGN COL(I) WITH T(I, *)
!HPF$ ALIGN COL(J) WITH T(*, J)
Alignment of arrays in LU
example
Communication in LU example
The statement
COL(K:N) = A(K:N, K)
broadcasts Kth column of A.
The statement
A(K, K+1:N) = A(K, K+1:N) / COL(K)
needs no communication (reason for using
COL(K) rather than A(K, K)).
The statement
ROW(K+1:N) = A(K, K+1:N)
broadcasts Kth row of A.
Distribution in LU example
Block-wise distribution unattractive due to
poor load balancing. Cyclic distribution
preferable:
!HPF$ PROCESSORS P(NP, NP)
!HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P
LU example with directives
!HPF$ PROCESSORS P(NP, NP)
!HPF$ TEMPLATE T(N, N)
!HPF$ DISTRIBUTE T(CYCLIC, CYCLIC) ONTO P
REAL A(N, N)
!HPF$ ALIGN A(I, J) WITH T(I, J)
REAL COL(N), ROW(N)
!HPF$ ALIGN COL(I) WITH T(I, *)
!HPF$ ALIGN COL(J) WITH T(*, J)
DO K = 1, N – 1
COL(K:N) = A(K:N, K)
A(K, K+1:N) = A(K, K+1:N) / COL(K)
ROW(K+1:N) = A(K, K+1:N)
FORALL (I = K+1:N, J = K+1:N)
&
A(I, J) = A(I, J) – COL(I) * ROW(J)
ENDDO
Other alignment options
Permuted dimensions:
DIMENSION B(N, N)
!HPF$ ALIGN B(I, J) WITH T(J, I)
Affine intra-dimensional alignment:
DIMENSION C(N/2, N/2)
!HPF$ ALIGN C(I, J) WITH T(N/2 + I, 2*J)
Intra-dimensional alignment
example
Next Lecture:
Issues in translation of High
Performance Fortran
Translation of simple HPF programs to
SPMD (MPI) programs.
Design of a runtime Distributed Array
Descriptor.