Transcript Document

A Multi-platform Co-array Fortran Compiler
for High-performance Computing
www.hipersoft.rice.edu/caf
Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey, Daniel Chavarría-Miranda
{ccristi, dotsenko, johnmc, danich}@cs.rice.edu
• Point-to-point synchronization
– sync_notify(p)
• SPMD programming model
– fixed number of images during execution
– images operate asynchronously
– sync_wait(p)
• Both private and shared data
– real a(20,20)
– real a(20,20) [*]
• Less restrictive memory fences at call site
• Collective operations
private: a 20x20 array in each image
shared: a 20x20 array in each image
• Simple one-sided communication (PUT & GET)
– x(:,j:j+2) = a(r,:) [p:p+2]
Current Optimizations
– sync_team(team [,wait])
• team = a vector of process ids to synchronize with
• wait = a vector of processes to wait for
• Procedure Splitting
• Hints for non-blocking communication
• Library-based and load/store communication
• Packing strided communication
• Pointers and dynamic allocation
integer :: a(10,20)[*]
Planned Optimizations
image 1
a(10,20)
image 2
a(10,20)
image N
if (this_image() > 1) Copies from left neighbor
a(1:10,1:2)=a(1:10,19:20)[this_image()-1]
image 1
image 2
image N
• Source-to-source code generation
• Open source compiler
• Build on Open64/SL infrastructure
• Support for core language features
• Code generation:
– library-based communication:
copy rows from p:p+2 into
local columns
• Flexible explicit synchronization
a(10,20)
Rice CAF Compiler
CAF Model Refinements
Co-Array Fortran Language
NSF
• Communication vectorization
• Synchronization strength-reduction
• Automatic split-phase communication
• Platform-driven communication optimizations
– transform communication from one-sided into twosided and collective, if useful
– multi-model code for hierarchical architectures
– convert GETs into PUTs
• Multi-buffer co-arrays for asynchrony tolerance
• Employ virtualization for latency tolerance
• Interoperability with other programming models
widely-portable ARMCI
communication library and array descriptor CHASM library
– load/store communication:
on shared-memory platforms
• Operating systems:
– Linux IA64/IA32
– Alpha Tru64
– SGI IRIX64
• Interconnects & Platforms:
– Quadrics QSNet (Elan 3), QSNet II (Elan 4)
– Myrinet 2000
– Ethernet
– SGI Altix 3000, SGI Origin 2000
CAF Applications and Benchmarks
• Sweep3D – wave-front parallelism
• Spark98 – sparse matrix vector multiply
• NAS Parallel Benchmarks 2.3: MG, CG, SP, BT, LU
• Random Access
• STREAM
Performance Results on Cluster-based Platforms
Sweep3D 3003
NAS SP Class C
NAS MG Class C
Performance Results on SGI Altix 3000
Sweep3D 3003
NAS SP Class C
NAS MG Class C