Transcript Document
A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA 1 Motivation Parallel Programming Models • MPI: de facto standard – difficult to program • OpenMP: inefficient to map on distributed memory platforms – lack of locality control • HPF: hard to obtain high-performance – heroic compilers needed! Global address space languages: CAF, Titanium, UPC an appealing middle ground 2 Co-Array Fortran • Global address space programming model – one-sided communication (GET/PUT) • Programmer has control over performance-critical factors – data distribution – computation partitioning – communication placement • Data movement and synchronization as language primitives – amenable to compiler-based communication optimization 3 CAF Programming Model Features • SPMD process images – fixed number of images during execution – images operate asynchronously • Both private and shared data – real x(20, 20) a private 20x20 array in each image – real y(20, 20)[*] a shared 20x20 array in each image • Simple one-sided shared-memory communication – x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns • Synchronization intrinsic functions – sync_all – a barrier and a memory fence – sync_mem – a memory fence – sync_team([team members to notify], [team members to wait for]) • Pointers and (perhaps asymmetric) dynamic allocation 4 One-sided Communication with Co-Arrays integer a(10,20)[*] a(10,20) a(10,20) a(10,20) image 1 image 2 image N if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] image 1 image 2 image N 5 Rice Co-Array Fortran Compiler (cafc) • First CAF multi-platform compiler – previous compiler only for Cray shared memory systems • Implements core of the language – currently lacks support for derived type and dynamic co-arrays • Core sufficient for non-trivial codes • Performance comparable to that of hand-tuned MPI codes • Open source 6 Outline • CAF programming model • cafc Core language implementation – Optimizations • Experimental evaluation • Conclusions 7 Implementation Strategy • Source-to-source compilation of CAF codes – uses Open64/SL Fortran 90 infrastructure – CAF Fortran 90 + communication operations • Communication – ARMCI library for one-sided communication on clusters – load/store communication on shared-memory platforms Goals –portability –high-performance on a wide range of platforms 8 Co-Array Descriptors real :: a(10,10,10)[*] type CAFDesc_real_3 integer(ptrkind) :: handle ! ! real, pointer:: ptr(:,:,:) ! ! end Type CAFDesc_real_3 Opaque handle to CAF runtime representation Fortran 90 pointer to local co-array data type(CAFDesc_real_3):: a • Initialize and manipulate Fortran 90 dope vectors 9 Allocating COMMON and SAVE Co-Arrays • Compiler – generates static initializer for each common/save variable • Linker – collects calls to all initializers – generates global initializer that calls all others – compiles global initializer and links into program • Launch – invokes global initializer before main program begins • allocates co-array storage outside Fortran 90 runtime system • associates co-array descriptors with allocated memory Similar to handling for C++ static constructors 10 Parameter Passing • Call-by-value convention (copy-in, copy-out) call f((a(I)[p])) – pass remote co-array data to procedures only as values • Call-by-co-array convention* – argument declared as a co-array by callee subroutine f(a) real :: a(10)[*] – enables access to local and remote co-array data • Call-by-reference convention* (cafc) – argument declared as an explicit shape array – enables access to local co-array data only real :: x(10)[*] call f(x) subroutine f(a) real :: a(10) – enables reuse of existing Fortran code * requires an explicit interface 11 Multiple Co-dimensions Managing processors as a logical multi-dimensional grid integer a(10,10)[5,4,*] 3D processor grid 5 x 4 x … • Support co-space reshaping at procedure calls – change number of co-dimensions – co-space bounds as procedure arguments 12 Implementing Communication x(1:n) = a(1:n)[p] + … • Use a temporary buffer to hold off processor data – allocate buffer – perform GET to fill buffer – perform computation: x(1:n) = buffer(1:n) + … – deallocate buffer • Optimizations – no temporary storage for co-array to co-array copies – load/store communication on shared-memory systems 13 Synchronization • Original CAF specification: team synchronization only – sync_all, sync_team • Limits performance on loosely-coupled architectures • Point-to-point extensions – sync_notify(q) – sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p all communication from p to q issued before the notify has been delivered to q 14 Outline • CAF programming model • cafc – Core language implementation Optimizations • procedure splitting • supporting hints for non-blocking communication • packing strided communications • Experimental evaluation • Conclusions 15 An Impediment to Code Efficiency • Original reference rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - … • Transformed reference rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - … • Fortran 90 pointer-based co-array representation does not convey – the lack of co-array aliasing – co-array contiguity – co-array bounds • Lack of knowledge inhibits important code optimizations 16 Procedure Splitting CAF to CAF preprocessing subroutine f(…) real, save :: c(100)[*] ... = c(50) ... end subroutine f subroutine f(…) real, save :: c(100)[*] interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_inner end interface call f_inner(…,c) end subroutine f subroutine f_inner(…, c_arg) real :: c_arg(100)[*] ... = c_arg(50) ... end subroutine f_inner 17 Benefits of Procedure Splitting • Generated code conveys – lack of co-array aliasing – co-array contiguity – co-array bounds • Enables back-end compiler to generate better code 18 Hiding Communication Latency Goal: enable communication/computation overlap • Impediments to generating non-blocking communication – use of indexed subscripts in co-dimensions – lack of whole program analysis • Approach: support hints for non-blocking communication – overcome conservative compiler analysis – enable sophisticated programmers to achieve good performance today 19 Hints for Non-blocking PUTs • Hints for CAF run-time system to issue non-blocking PUTs region_id = open_nb_put_region() ... Put_Stmt_1 ... Put_Stmt_N ... call close_nb_put_region(region_id) • Complete non-blocking PUTs: call complete_nb_put_region(region_id) • Open problem: Exploiting non-blocking GETs? 20 Strided vs. Contiguous Transfers • Problem CAF remote reference might induce many small data transfers a(i,1:n)[p] = b(j,1:n) • Solution pack strided data on source and unpack it on destination 21 Pragmatics of Packing Who should implement packing? • The CAF programmer – difficult to program • The CAF compiler – unpacking requires conversion of PUTs into two-sided communication (a difficult whole-program transformation) • The communication library – most natural place – ARMCI currently performs packing on Myrinet 22 CAF Compiler Targets (Sept 2004) • Processors – Pentium, Alpha, Itanium2, MIPS • Interconnects – Quadrics, Myrinet, Gigabit Ethernet, shared memory • Operating systems – Linux, Tru64, IRIX 23 Outline • CAF programming model • cafc – Core language implementation – Optimizations Experimental evaluation • Conclusions 24 Experimental Evaluation • Platforms – Alpha+Quadrics QSNet (Elan3) – Itanium2+Quadrics QSNet II (Elan4) – Itanium2+Myrinet 2000 • Codes – NAS Parallel Benchmarks (NPB 2.3) from NASA Ames 25 NAS BT Efficiency (Class C) 26 NAS SP Efficiency (Class C) lack of non-blocking notify implementation blocks CAF comm/comp overlap 27 NAS MG Efficiency (Class C) • ARMCI comm is efficient • pt-2-pt synch in boosts CAF performance 30% 28 NAS CG Efficiency (Class C) 29 NAS LU Efficiency (class C) 30 Impact of Optimizations Assorted Results • Procedure splitting – 42-60% improvement for BT on Itanium2+Myrinet cluster – 15-33% improvement for LU on Alpha+Quadrics • Non-blocking communication generation – 5% improvement for BT on Itanium2+Quadrics cluster – 3% improvement for MG on all platforms • Packing of strided data – 31% improvement for BT on Alpha+Quadrics cluster – 37% improvement for LU on Itanium2+Quadrics cluster See paper for more details 31 Conclusions • CAF boosts programming productivity – simplifies the development of SPMD parallel programs – shifts details of managing communication to compiler • cafc delivers performance comparable to hand-tuned MPI • cafc implements effective optimizations – procedure splitting – non-blocking communication – packing of strided communication (in ARMCI) • Vectorization needed to achieve true performance portability with machines like Cray X1 http://www.hipersoft.rice.edu/caf 32