Transcript Document

An Evaluation of Global Address
Space Languages: Co-Array Fortran
and Unified Parallel C
Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey
Rice University
Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao
George Washington University
Daniel Chavarria-Miranda
Pacific Northwest National Laboratory
1
GAS Languages
• Global address space programming model
– one-sided communication (GET/PUT)
simpler than msg passing
• Programmer has control over performance-critical factors
– data distribution and locality control
– computation partitioning
– communication placement
lacking in OpenMP
HPF & OpenMP compilers
must get this right
• Data movement and synchronization as language primitives
– amenable to compiler-based communication optimization
2
Questions
• Can GAS languages match the performance of hand-tuned
message passing programs?
• What are the obstacles to obtaining performance with GAS
languages?
• What should be done to ameliorate them?
– by language modifications or extensions
– by compilers
– by run-time systems
• How easy is it to develop high performance programs in GAS
languages?
3
Approach
Evaluate CAF and UPC using NAS Parallel Benchmarks
• Compare performance to that of MPI versions
– use hardware performance counters to pinpoint differences
• Determine optimization techniques common for both
languages as well as language specific optimizations
– language features
– program implementation strategies
– compiler optimizations
– runtime optimizations
• Assess programmability of the CAF and UPC variants
4
Outline
• Questions and approach
• CAF & UPC
– Features
– Compilers
– Performance considerations
• Experimental evaluation
• Conclusions
5
CAF & UPC Common Features
• SPMD programming model
• Both private and shared data
• Language-level one-sided shared-memory communication
• Synchronization intrinsic functions (barrier, fence)
• Pointers and dynamic allocation
6
CAF & UPC Differences I
• Multidimensional arrays
– CAF: multidimensional arrays, procedure argument reshaping
– UPC: linearization, typically using macros
• Local accesses to shared data
– CAF: Fortran 90 array syntax without brackets, e.g. a(1:M,N)
– UPC: shared array reference using MYTHREAD or a C pointer
7
CAF and UPC Differences II
• Scalar/element-wise remote accesses
– CAF: multidimensional subscripts + bracket syntax
a(1,1) = a(1,M)[this_image()-1]
– UPC: shared (“flat”) array access with linearized subscripts
a[N*M*MYTHREAD] = a[N*M*MYTHREAD-N]
• Bulk and strided remote accesses
– CAF: use natural syntax of Fortran 90 array sections and operations on
remote co-array sections (less temporaries on SMPs)
– UPC: use library functions (and temporary storage to hold a copy)
8
Bulk Communication
CAF:
integer a(N,M)[*]
a(1:N,1:2) = a(1:N,M-1:M)[this_image()-1]
UPC:
shared int *a;
upc_memget(&a[N*M*MYTHREAD], &a[N*M*MYTHREAD-2*N], 2*N*sizeof(int));
M
N
P1
P2
PN
9
CAF & UPC Differences III
• Synchronization
– CAF: team synchronization
– UPC: split-phase barrier, locks
• UPC: worksharing construct upc_forall
• UPC: richer set of pointer types
10
Outline
• Questions and approach
• CAF & UPC
– Features
– Compilers
– Performance considerations
• Experimental evaluation
• Conclusions
11
CAF Compilers
• Rice Co-Array Fortran Compiler (cafc)
– Multi-platform compiler
– Implements core of the language
• core sufficient for non-trivial codes
• currently lacks support for derived type and dynamic co-arrays
– Source-to-source translator
• translates CAF into Fortran 90 and communication code
• uses ARMCI or GASNet as communication substrate
• can generate load/store for remote data accesses on SMPs
– Performance comparable to that of hand-tuned MPI codes
– Open source
• Vendor compilers: Cray
12
UPC Compilers
• Berkeley UPC Compiler
– Multi-platform compiler
– Implements full UPC 1.1 specification
– Source-to-source translator
• converts UPC into ANSI C and calls to UPC runtime library & GASNet
• tailors code to a specific architecture: cluster or SMP
– Open source
• Intrepid UPC compiler
– Based on GCC compiler
– Works on SGI Origin, Cray T3E and Linux SMP
• Other vendor compilers: Cray, HP
13
Outline
• Motivation and Goals
• CAF & UPC
– Features
– Compilers
– Performance considerations
• Experimental evaluation
• Conclusions
14
Scalar Performance
• Generate code amenable to backend compiler optimizations
– Quality of back end compilers
• poor reduction recognition in the Intel C compiler
• Local access to shared data
– CAF: use F90 pointers and procedure arguments
– UPC: use C pointers instead of UPC shared pointers
• Alias and dependence analysis
– Fortran vs. C language semantics
• multidimensional arrays in Fortran
• procedure argument reshaping
– Convey lack of aliasing for (non-aliased) shared variables
• CAF: use procedure splitting so co-arrays are referenced as arguments
• UPC: use restrict C99 keyword for C pointers used to access shared data
15
Communication
• Communication vectorization is essential for high performance
on cluster architectures for both languages
– CAF
• use F90 array sections (compiler translates to appropriate library calls)
– UPC
• use library functions for contiguous transfers
• use UPC extensions for strided transfer in Berkeley UPC compiler
• Increase efficiency of strided transfers by packing/unpacking
data at the language level
16
Synchronization
• Barrier-based synchronization
– Can lead to over-synchronized code
• Use point-to-point synchronization
– CAF: proposed language extension (sync_notify, sync_wait)
– UPC: language-level implementation
17
Outline
• Questions and approach
• CAF & UPC
• Experimental evaluation
• Conclusions
18
Platforms and Benchmarks
• Platforms
– Itanium2+Myrinet 2000 (900 MHz Itanium2)
– Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB)
– SGI Altix 3000 (1.5 GHz Itanium2)
– SGI Origin 2000 (R10000)
• Codes
– NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
– MG, CG, SP, BT
– CAF and UPC versions were derived from Fortran77+MPI versions
19
MG class A (2563) on Itanium2+Myrinet2000
Intel compiler: restrict yields 2.3
time performance improvement
CAF
point to point
35% faster than
barriers
UPC
strided comm
UPC than
28% faster
point to
point
multiple
transfers
49% faster than
barriers
Higher is better
20
MG class C (5123) on SGI Altix 3000
Intel C compiler: scalar
performance
Fortran compiler: linearized array
subscripts 30% slowdown compared
to multidimensional subscripts
64
Higher is better
21
MG class B (2563) on SGI Origin 2000
Higher is better
22
CG class C (150000) on SGI Altix 3000
Intel compiler: sum reductions in
C 2.6 times slower than Fortran!
point to
point
19% faster
than
barriers
Higher is better
23
CG class B (75000) on SGI Origin 2000
Intrepid compiler (gcc): sum reductions in
C is up to 54% slower than SGI C/Fortran!
Higher is better
24
SP class C (1623) on Itanium2+Myrinet2000
restrict yields
18%
performance
improvement
Higher is better
25
SP class C (1623) on Alpha+Quadrics
Higher is better
26
BT class C (1623) on Itanium2+Myrinet2000
CAF: comm.
packing 7%
faster
CAF: procedure splitting
improves performance 42-60%
UPC: comm.
packing 32%
faster
UPC: use of restrict boosts
the performance 43%
Higher is better
27
BT class B (1023) on SGI Altix 3000
use of restrict
improves
performance 30%
Higher is better
28
Conclusions
• Matching MPI performance required using bulk communication
– library-based primitives are cumbersome in UPC
– communicating multi-dimensional array sections is natural in CAF
– lack of efficient run-time support for strided communication is a
problem
• With CAF, can achieve performance comparable to MPI
• With UPC, matching MPI performance can be difficult
– CG: able to match MPI on all platforms
– SP, BT, MG: substantial gap remains
29
Why the Gap?
• Communication layer is not the problem
– CAF with ARMCI or GASNet yields equivalent performance
• Scalar code optimization of scientific code is the key!
– SP+BT: SGI Fortran: unroll+jam, SWP
– MG: SGI Fortran: loop alignment, fusion
– CG: Intel Fortran: optimized sum reduction
• Linearized subscripts for multidimensional arrays hurt!
– measured 30% performance gap with Intel Fortran
30
Programming for Performance
• In the absence of effective optimizing compilers for CAF and
UPC, achieving high performance is difficult
• To make codes efficient across the full range of
architectures, we need
– better language support for synchronization
• point-to-point synchronization is an important common case!
– better CAF & UPC compiler support
• communication vectorization
• synchronization strength reduction
– better compiler optimization of loops with complex dependence
patterns
– better run-time library support
• efficient communication of strided array sections
31