Transcript Document
A Multi-platform Co-array Fortran Compiler for High-Performance Computing
John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa
• SPMD process images
– number of images fixed during execution
– images operate asynchronously
Simple and expressive models for
high performance programming
based on extensions to widely used languages
• Performance: users control data and computation partitioning
• Portability: same language for SMPs, MPPs, and clusters
• Programmability: global address space for simplicity
• Simple one-sided shared memory communication
copy rows from p:p+2 into local
• Pointers and dynamic allocation
A sensible
alternative to
these extremes
me = this_image()
• Synchronization strength-reduction
• Communication vectorization
• Platform-driven communication optimizations
. . .
! ghost cell update
a(1:N,N+1)[left(me)] = a(1:N,0)
• Transform as useful from 1-sided to two-sided and collective comm.
• Generate both fine-grain load/store and calls to communication
libraries as necessary
• Multi-model code for hierarchical architectures
• Convert Gets into Puts
• Parallel I/O
integer
integer:: handle
real(8):: ptr(:,:)
A(10,10)[*]
A(10,10)
A(10,10)
• The compiler is responsible for
data locality and communication
• Annotated sequential code (semiautomatic
• Using MPI can be difficult and error prone
• Most of the burden for communication
parallelization)
• Requires heroic compiler technology
• The model limits the application paradigms:
extensions to the standard are required for
supporting irregular computation
• Source-to-source code generation for wide portability
type(CafHandleReal8) a_caf
• Open source compiler will be available
. . .
image 1
image 2
image N
if (me .eq. 1) then
A(1:3,1:5)[me+1] = A(1:3,1:5)[me]
A(10,10)
A(10,10)
A(10,10)
allocate( cafBuffer_1%ptr(1:N,0:0) )
• Working prototype for core language features
cafBuffer_2%ptr => a_caf%ptr(1:N,N+1:N+1)
cafBuffer_1%ptr = a_caf%ptr(1:N,0)
• Current compiler implementation performs no optimization
– each co-array access is transformed into a get/put operation at the
call CafArmciPutS(a_caf%handle, left(me),
cafBuffer_1, cafBuffer_2)
• Code generation uses the widely-portable ARMCI
communication library
deallocate( cafBuffer_1%ptr )
• Front-end based on production-quality Open64 front end,
modified to support source-to-source compilation
. . .
image 1
Performance Results on IA64+Myrinet 2000
Implementation Status
end type
A(10,10)
communication and data locality
• Compiler-directed parallel I/O with UIUC
• Interoperability with other parallel programming models
type CafHandleReal8
HPF
optimization falls on application developers;
compiler support is underutilized
real(8) a(0:N+1,0:N+1)[*]
. . .
• Flexible synchronization
• Enhancements to Co-Array Fortran model
• Point-to-point one-way synchronization
• Hints for matching synchronization events
• Collective operations intrinsincs
• Split-phase primitives
. . .
• Both private and shared data
– real a(20,20)
private: a 20x20 array in each image
– real a(20,20) [*] shared: a 20x20 array in each image
– x(:,j:j+2) = a(r,:) [p:p+2]
columns
PUT Translation Example
– sync_team(team [,wait])
• team = a vector of process ids to synchronize with
• wait = a vector of processes to wait for (a subset of team)
Co-Array Fortran
• Portable and widely used
• The programmer has explicit control over
Research Focus
Co-Array Fortran Language
Programming Models
for High-Performance Computing
MPI
{johnmc, dotsenko, ccristi}@cs.rice.edu
image 2
same point in the code
image N
Performance Results on Alpha+Quadrics *
* For NAS BT and CG the base case is synthetic, so that the first measurable point has efficiency 1.0
Performance Results on SGI Altix 3000 **
** Preliminary results on a loaded system in the presence of other users competing for the memory bandwidth