Transcript Document
An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao George Washington University Daniel Chavarria-Miranda Pacific Northwest National Laboratory 1 GAS Languages • Global address space programming model – one-sided communication (GET/PUT) simpler than msg passing • Programmer has control over performance-critical factors – data distribution and locality control – computation partitioning – communication placement lacking in OpenMP HPF & OpenMP compilers must get this right • Data movement and synchronization as language primitives – amenable to compiler-based communication optimization 2 Questions • Can GAS languages match the performance of hand-tuned message passing programs? • What are the obstacles to obtaining performance with GAS languages? • What should be done to ameliorate them? – by language modifications or extensions – by compilers – by run-time systems • How easy is it to develop high performance programs in GAS languages? 3 Approach Evaluate CAF and UPC using NAS Parallel Benchmarks • Compare performance to that of MPI versions – use hardware performance counters to pinpoint differences • Determine optimization techniques common for both languages as well as language specific optimizations – language features – program implementation strategies – compiler optimizations – runtime optimizations • Assess programmability of the CAF and UPC variants 4 Outline • Questions and approach • CAF & UPC – Features – Compilers – Performance considerations • Experimental evaluation • Conclusions 5 CAF & UPC Common Features • SPMD programming model • Both private and shared data • Language-level one-sided shared-memory communication • Synchronization intrinsic functions (barrier, fence) • Pointers and dynamic allocation 6 CAF & UPC Differences I • Multidimensional arrays – CAF: multidimensional arrays, procedure argument reshaping – UPC: linearization, typically using macros • Local accesses to shared data – CAF: Fortran 90 array syntax without brackets, e.g. a(1:M,N) – UPC: shared array reference using MYTHREAD or a C pointer 7 CAF and UPC Differences II • Scalar/element-wise remote accesses – CAF: multidimensional subscripts + bracket syntax a(1,1) = a(1,M)[this_image()-1] – UPC: shared (“flat”) array access with linearized subscripts a[N*M*MYTHREAD] = a[N*M*MYTHREAD-N] • Bulk and strided remote accesses – CAF: use natural syntax of Fortran 90 array sections and operations on remote co-array sections (less temporaries on SMPs) – UPC: use library functions (and temporary storage to hold a copy) 8 Bulk Communication CAF: integer a(N,M)[*] a(1:N,1:2) = a(1:N,M-1:M)[this_image()-1] UPC: shared int *a; upc_memget(&a[N*M*MYTHREAD], &a[N*M*MYTHREAD-2*N], 2*N*sizeof(int)); M N P1 P2 PN 9 CAF & UPC Differences III • Synchronization – CAF: team synchronization – UPC: split-phase barrier, locks • UPC: worksharing construct upc_forall • UPC: richer set of pointer types 10 Outline • Questions and approach • CAF & UPC – Features – Compilers – Performance considerations • Experimental evaluation • Conclusions 11 CAF Compilers • Rice Co-Array Fortran Compiler (cafc) – Multi-platform compiler – Implements core of the language • core sufficient for non-trivial codes • currently lacks support for derived type and dynamic co-arrays – Source-to-source translator • translates CAF into Fortran 90 and communication code • uses ARMCI or GASNet as communication substrate • can generate load/store for remote data accesses on SMPs – Performance comparable to that of hand-tuned MPI codes – Open source • Vendor compilers: Cray 12 UPC Compilers • Berkeley UPC Compiler – Multi-platform compiler – Implements full UPC 1.1 specification – Source-to-source translator • converts UPC into ANSI C and calls to UPC runtime library & GASNet • tailors code to a specific architecture: cluster or SMP – Open source • Intrepid UPC compiler – Based on GCC compiler – Works on SGI Origin, Cray T3E and Linux SMP • Other vendor compilers: Cray, HP 13 Outline • Motivation and Goals • CAF & UPC – Features – Compilers – Performance considerations • Experimental evaluation • Conclusions 14 Scalar Performance • Generate code amenable to backend compiler optimizations – Quality of back end compilers • poor reduction recognition in the Intel C compiler • Local access to shared data – CAF: use F90 pointers and procedure arguments – UPC: use C pointers instead of UPC shared pointers • Alias and dependence analysis – Fortran vs. C language semantics • multidimensional arrays in Fortran • procedure argument reshaping – Convey lack of aliasing for (non-aliased) shared variables • CAF: use procedure splitting so co-arrays are referenced as arguments • UPC: use restrict C99 keyword for C pointers used to access shared data 15 Communication • Communication vectorization is essential for high performance on cluster architectures for both languages – CAF • use F90 array sections (compiler translates to appropriate library calls) – UPC • use library functions for contiguous transfers • use UPC extensions for strided transfer in Berkeley UPC compiler • Increase efficiency of strided transfers by packing/unpacking data at the language level 16 Synchronization • Barrier-based synchronization – Can lead to over-synchronized code • Use point-to-point synchronization – CAF: proposed language extension (sync_notify, sync_wait) – UPC: language-level implementation 17 Outline • Questions and approach • CAF & UPC • Experimental evaluation • Conclusions 18 Platforms and Benchmarks • Platforms – Itanium2+Myrinet 2000 (900 MHz Itanium2) – Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB) – SGI Altix 3000 (1.5 GHz Itanium2) – SGI Origin 2000 (R10000) • Codes – NAS Parallel Benchmarks (NPB 2.3) from NASA Ames – MG, CG, SP, BT – CAF and UPC versions were derived from Fortran77+MPI versions 19 MG class A (2563) on Itanium2+Myrinet2000 Intel compiler: restrict yields 2.3 time performance improvement CAF point to point 35% faster than barriers UPC strided comm UPC than 28% faster point to point multiple transfers 49% faster than barriers Higher is better 20 MG class C (5123) on SGI Altix 3000 Intel C compiler: scalar performance Fortran compiler: linearized array subscripts 30% slowdown compared to multidimensional subscripts 64 Higher is better 21 MG class B (2563) on SGI Origin 2000 Higher is better 22 CG class C (150000) on SGI Altix 3000 Intel compiler: sum reductions in C 2.6 times slower than Fortran! point to point 19% faster than barriers Higher is better 23 CG class B (75000) on SGI Origin 2000 Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran! Higher is better 24 SP class C (1623) on Itanium2+Myrinet2000 restrict yields 18% performance improvement Higher is better 25 SP class C (1623) on Alpha+Quadrics Higher is better 26 BT class C (1623) on Itanium2+Myrinet2000 CAF: comm. packing 7% faster CAF: procedure splitting improves performance 42-60% UPC: comm. packing 32% faster UPC: use of restrict boosts the performance 43% Higher is better 27 BT class B (1023) on SGI Altix 3000 use of restrict improves performance 30% Higher is better 28 Conclusions • Matching MPI performance required using bulk communication – library-based primitives are cumbersome in UPC – communicating multi-dimensional array sections is natural in CAF – lack of efficient run-time support for strided communication is a problem • With CAF, can achieve performance comparable to MPI • With UPC, matching MPI performance can be difficult – CG: able to match MPI on all platforms – SP, BT, MG: substantial gap remains 29 Why the Gap? • Communication layer is not the problem – CAF with ARMCI or GASNet yields equivalent performance • Scalar code optimization of scientific code is the key! – SP+BT: SGI Fortran: unroll+jam, SWP – MG: SGI Fortran: loop alignment, fusion – CG: Intel Fortran: optimized sum reduction • Linearized subscripts for multidimensional arrays hurt! – measured 30% performance gap with Intel Fortran 30 Programming for Performance • In the absence of effective optimizing compilers for CAF and UPC, achieving high performance is difficult • To make codes efficient across the full range of architectures, we need – better language support for synchronization • point-to-point synchronization is an important common case! – better CAF & UPC compiler support • communication vectorization • synchronization strength reduction – better compiler optimization of loops with complex dependence patterns – better run-time library support • efficient communication of strided array sections 31