Transcript Document

A Multi-platform
Co-Array Fortran Compiler
Yuri Dotsenko Cristian Coarfa
John Mellor-Crummey
Department of Computer Science
Rice University
Houston, TX USA
1
Motivation
Parallel Programming Models
• MPI: de facto standard
– difficult to program
• OpenMP: inefficient to map on distributed memory platforms
– lack of locality control
• HPF: hard to obtain high-performance
– heroic compilers needed!
Global address space languages: CAF, Titanium, UPC
an appealing middle ground
2
Co-Array Fortran
• Global address space programming model
– one-sided communication (GET/PUT)
• Programmer has control over performance-critical factors
– data distribution
– computation partitioning
– communication placement
• Data movement and synchronization as language primitives
– amenable to compiler-based communication optimization
3
CAF Programming Model Features
• SPMD process images
– fixed number of images during execution
– images operate asynchronously
• Both private and shared data
– real x(20, 20)
a private 20x20 array in each image
– real y(20, 20)[*]
a shared 20x20 array in each image
• Simple one-sided shared-memory communication
– x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns
• Synchronization intrinsic functions
– sync_all – a barrier and a memory fence
– sync_mem – a memory fence
– sync_team([team members to notify], [team members to wait for])
• Pointers and (perhaps asymmetric) dynamic allocation
4
One-sided Communication with Co-Arrays
integer a(10,20)[*]
a(10,20)
a(10,20)
a(10,20)
image 1
image 2
image N
if (this_image() > 1)
a(1:10,1:2) = a(1:10,19:20)[this_image()-1]
image 1
image 2
image N
5
Rice Co-Array Fortran Compiler (cafc)
• First CAF multi-platform compiler
– previous compiler only for Cray shared memory systems
• Implements core of the language
– currently lacks support for derived type and dynamic co-arrays
• Core sufficient for non-trivial codes
• Performance comparable to that of hand-tuned MPI codes
• Open source
6
Outline
• CAF programming model
• cafc
Core language implementation
– Optimizations
• Experimental evaluation
• Conclusions
7
Implementation Strategy
• Source-to-source compilation of CAF codes
– uses Open64/SL Fortran 90 infrastructure
– CAF  Fortran 90 + communication operations
• Communication
– ARMCI library for one-sided communication on clusters
– load/store communication on shared-memory platforms
Goals
–portability
–high-performance on a wide range of platforms
8
Co-Array Descriptors
real :: a(10,10,10)[*]
type CAFDesc_real_3
integer(ptrkind) :: handle !
!
real, pointer:: ptr(:,:,:) !
!
end Type CAFDesc_real_3
Opaque handle
to CAF runtime representation
Fortran 90 pointer
to local co-array data
type(CAFDesc_real_3):: a
• Initialize and manipulate Fortran 90 dope vectors
9
Allocating COMMON and SAVE Co-Arrays
• Compiler
– generates static initializer for each common/save variable
• Linker
– collects calls to all initializers
– generates global initializer that calls all others
– compiles global initializer and links into program
• Launch
– invokes global initializer before main program begins
• allocates co-array storage outside Fortran 90 runtime system
• associates co-array descriptors with allocated memory
Similar to handling for C++ static constructors
10
Parameter Passing
• Call-by-value convention (copy-in, copy-out)
call f((a(I)[p]))
– pass remote co-array data to procedures only as values
• Call-by-co-array convention*
– argument declared as a co-array by callee
subroutine f(a)
real :: a(10)[*]
– enables access to local and remote co-array data
• Call-by-reference convention* (cafc)
– argument declared as an explicit shape array
– enables access to local co-array data only
real :: x(10)[*]
call f(x)
subroutine f(a)
real :: a(10)
– enables reuse of existing Fortran code
* requires an explicit interface
11
Multiple Co-dimensions
Managing processors as a logical multi-dimensional grid
integer a(10,10)[5,4,*]
3D processor grid 5 x 4 x …
• Support co-space reshaping at procedure calls
– change number of co-dimensions
– co-space bounds as procedure arguments
12
Implementing Communication
x(1:n) = a(1:n)[p] + …
• Use a temporary buffer to hold off processor data
– allocate buffer
– perform GET to fill buffer
– perform computation:
x(1:n) = buffer(1:n) + …
– deallocate buffer
• Optimizations
– no temporary storage for co-array to co-array copies
– load/store communication on shared-memory systems
13
Synchronization
• Original CAF specification: team synchronization only
– sync_all, sync_team
• Limits performance on loosely-coupled architectures
• Point-to-point extensions
– sync_notify(q)
– sync_wait(p)
Point to point synchronization semantics
Delivery of a notify to q from p 
all communication from p to q issued before the notify has been delivered to q
14
Outline
• CAF programming model
• cafc
– Core language implementation
Optimizations
• procedure splitting
• supporting hints for non-blocking communication
• packing strided communications
• Experimental evaluation
• Conclusions
15
An Impediment to Code Efficiency
• Original reference
rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - …
• Transformed reference
rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - …
• Fortran 90 pointer-based co-array representation does not
convey
– the lack of co-array aliasing
– co-array contiguity
– co-array bounds
• Lack of knowledge inhibits important code optimizations
16
Procedure Splitting
CAF to CAF preprocessing
subroutine f(…)
real, save :: c(100)[*]
... = c(50) ...
end subroutine f
subroutine f(…)
real, save :: c(100)[*]
interface
subroutine f_inner(…, c_arg)
real :: c_arg[*]
end subroutine f_inner
end interface
call f_inner(…,c)
end subroutine f
subroutine f_inner(…, c_arg)
real :: c_arg(100)[*]
... = c_arg(50) ...
end subroutine f_inner
17
Benefits of Procedure Splitting
• Generated code conveys
– lack of co-array aliasing
– co-array contiguity
– co-array bounds
• Enables back-end compiler to generate better code
18
Hiding Communication Latency
Goal: enable communication/computation overlap
• Impediments to generating non-blocking communication
– use of indexed subscripts in co-dimensions
– lack of whole program analysis
• Approach: support hints for non-blocking communication
– overcome conservative compiler analysis
– enable sophisticated programmers to achieve good
performance today
19
Hints for Non-blocking PUTs
• Hints for CAF run-time system to issue non-blocking PUTs
region_id = open_nb_put_region()
...
Put_Stmt_1
...
Put_Stmt_N
...
call close_nb_put_region(region_id)
• Complete non-blocking PUTs:
call complete_nb_put_region(region_id)
• Open problem: Exploiting non-blocking GETs?
20
Strided vs. Contiguous Transfers
• Problem
CAF remote reference might induce many small data transfers
a(i,1:n)[p] = b(j,1:n)
• Solution
pack strided data on source and unpack it on destination
21
Pragmatics of Packing
Who should implement packing?
• The CAF programmer
– difficult to program
• The CAF compiler
– unpacking requires conversion of PUTs into two-sided
communication (a difficult whole-program transformation)
• The communication library
– most natural place
– ARMCI currently performs packing on Myrinet
22
CAF Compiler Targets (Sept 2004)
• Processors
– Pentium, Alpha, Itanium2, MIPS
• Interconnects
– Quadrics, Myrinet, Gigabit Ethernet, shared memory
• Operating systems
– Linux, Tru64, IRIX
23
Outline
• CAF programming model
• cafc
– Core language implementation
– Optimizations
Experimental evaluation
• Conclusions
24
Experimental Evaluation
• Platforms
– Alpha+Quadrics QSNet (Elan3)
– Itanium2+Quadrics QSNet II (Elan4)
– Itanium2+Myrinet 2000
• Codes
– NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
25
NAS BT Efficiency (Class C)
26
NAS SP Efficiency (Class C)
lack of non-blocking notify implementation
blocks CAF comm/comp overlap
27
NAS MG Efficiency (Class C)
• ARMCI comm is efficient
• pt-2-pt synch in boosts
CAF performance 30%
28
NAS CG Efficiency (Class C)
29
NAS LU Efficiency (class C)
30
Impact of Optimizations
Assorted Results
• Procedure splitting
– 42-60% improvement for BT on Itanium2+Myrinet cluster
– 15-33% improvement for LU on Alpha+Quadrics
• Non-blocking communication generation
– 5% improvement for BT on Itanium2+Quadrics cluster
– 3% improvement for MG on all platforms
• Packing of strided data
– 31% improvement for BT on Alpha+Quadrics cluster
– 37% improvement for LU on Itanium2+Quadrics cluster
See paper for more details
31
Conclusions
• CAF boosts programming productivity
– simplifies the development of SPMD parallel programs
– shifts details of managing communication to compiler
• cafc delivers performance comparable to hand-tuned MPI
• cafc implements effective optimizations
– procedure splitting
– non-blocking communication
– packing of strided communication (in ARMCI)
• Vectorization needed to achieve true performance portability
with machines like Cray X1
http://www.hipersoft.rice.edu/caf
32