Transcript Document

Experiences with Sweep3D
Implementations in Co-array Fortran
Cristian Coarfa Yuri Dotsenko
John Mellor-Crummey
Department of Computer Science
Rice University
Houston, TX USA
Motivation
Parallel Programming Models
• MPI: de facto standard
– difficult to program
• OpenMP: inefficient to map on distributed memory platforms
– lack of locality control
• HPF: hard to obtain high-performance
– heroic compilers needed!
• An appealing middle ground: global address space languages:
– CAF, Titanium, UPC
Evaluate CAF for an application with sophisticated
parallelization: Sweep3D
Co-Array Fortran
• Global address space programming model
– one-sided communication (GET/PUT)
• Programmer has control over performance-critical factors
– data distribution
– computation partitioning
– communication placement
• Data movement and synchronization as language primitives
– amenable to compiler-based communication optimization
CAF Programming Model Features
• SPMD process images
– fixed number of images during execution
– images operate asynchronously
• Both private and shared data
– real x(20, 20)
– real y(20, 20)[*]
a private 20x20 array in each image
a shared 20x20 array in each image
• Simple one-sided shared-memory communication
– x(:,j:j+2) = y(:,p:p+2)[r] copy columns from
image r into local columns
• Synchronization intrinsic functions
– sync_all – a barrier and a memory fence
– sync_mem – a memory fence
– sync_team([team members to notify], [team
members to wait for])
• Pointers and (perhaps asymmetric) dynamic allocation
One-sided Communication with Co-Arrays
integer a(10,20)[*]
a(10,20)
a(10,20)
a(10,20)
image 1
image 2
image N
if (this_image() > 1)
Copy from lefth neighbor
a(1:10,1:2) = a(1:10,19:20)[this_image()-1]
image 1
image 2
image N
Outline
• CAF programming model
cafc
• Sweep3D implementations in CAF
• Experimental evaluation
• Conclusions
Rice Co-Array Fortran Compiler (cafc)
• First CAF multi-platform compiler
– previous compiler only for Cray shared memory systems
• Implements core of the language
– currently lacks support for derived type and dynamic co-arrays
• Core sufficient for non-trivial codes
• Performance comparable to that of hand-tuned MPI codes
• Open source
cafc Implementation Strategy
Goals
–portability
–high-performance on a wide range of platforms
• Source-to-source compilation of CAF codes
– uses Open64/SL Fortran 90 infrastructure
– CAF  Fortran 90 + communication operations
• Communication
– ARMCI library for one-sided communication on clusters (PNNL)
– load/store communication on shared-memory platforms
Synchronization
• Original CAF specification: team synchronization only
– sync_all, sync_team
• Limits performance on loosely-coupled architectures
• Point-to-point extensions
– sync_notify(q)
– sync_wait(p)
Point to point synchronization semantics
Delivery of a notify to q from p 
all communication from p to q issued before the notify has been delivered to q
CAF Compiler Targets (Oct 2004)
• Processors
– Pentium, Alpha, Itanium2, MIPS
• Interconnects
– Quadrics, Myrinet, Gigabit Ethernet, shared memory
• Operating systems
– Linux, Tru64, IRIX
Outline
• CAF programming model
• cafc
Sweep3D implementations
• Original MPI implementation
• CAF versions
• Communication microbenchmark
• Experimental evaluation
• Conclusions
Sweep3D
• Core of an ASCI application
• Solves a
– one-group
– time-independent
– discrete ordinates (Sn)
– 3D Cartesian (XYZ) geometry
– neutron transport problem
• Deterministic particle transport accounts for 50-80%
execution time of many realistic DOE simulations
Sweep3D Parallelization
2D spatial domain decomposition onto a 2D processor array
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Parallelization
Wavefront parallelism
Sweep3D Kernel Pseudocode
do iq=1,8
do mo = 1, mmo
do kk = 1, kb
recv e/w into Phiib
recv n/s into Phijb
...
! heavy computation with use/update
! of Phiib and Phijb
...
send e/w Phiib
send n/s Phijb
enddo
enddo
enddo
Sweep3D Kernel Pseudocode
do iq=1,8
do mo = 1, mmo
do kk = 1, kb
recv e/w into Phiib
recv n/s into Phijb
...
! heavy computation with use/update
! of Phiib and Phijb
...
send e/w Phiib
send n/s Phijb
enddo
enddo
enddo
Sweep3D Kernel Pseudocode
do iq=1,8
do mo = 1, mmo
do kk = 1, kb
recv e/w into Phiib
recv n/s into Phijb
...
! heavy computation with use/update
! of Phiib and Phijb
...
send e/w Phiib
send n/s Phijb
enddo
enddo
enddo
Sweep3D Kernel Pseudocode
do iq=1,8
do mo = 1, mmo
do kk = 1, kb
recv e/w into Phiib
recv n/s into Phijb
...
! heavy computation with use/update
! of Phiib and Phijb
...
send e/w Phiib
send n/s Phijb
enddo
enddo
enddo
Initial Sweep3D CAF
Implementation
• Based on the MPI implementation
• Maintain original computation
• Convert communication buffers into co-arrays
• Fundamental issue: converting from two-sided
communication into one-sided communication
2-sided vs 1-sided Communication
2-sided comm
2-sided vs 1-sided Communication
MPI_Send
2-sided comm
MPI_Recv
2-sided vs 1-sided Communication
MPI_Send
2-sided comm
MPI_Recv
2-sided vs 1-sided Communication
MPI_Send
2-sided comm
MPI_Recv
1-sided comm
2-sided vs 1-sided Communication
sync_notify
sync_wait
MPI_Send
2-sided comm
MPI_Recv
1-sided comm
2-sided vs 1-sided Communication
sync_notify
sync_wait
MPI_Send
2-sided comm
MPI_Recv
PUT
1-sided comm
2-sided vs 1-sided Communication
sync_notify
sync_wait
MPI_Send
MPI_Recv
PUT
sync_notify
sync_wait
2-sided comm
1-sided comm
2-sided vs 1-sided Communication
sync_notify
sync_wait
MPI_Send
MPI_Recv
PUT
sync_notify
sync_wait
2-sided comm
1-sided comm
CAF Implementation Issues
• Synchronization necessary to avoid data races might lead to
inefficiency
• Using multiple communication buffers enables overlap of
synchronization with computation
One- vs. Two-buffer Communication
One-buffer communication
source
dest
dpipeline
bubbles
pipeline
bubbles
Two-buffers communication
source
dest
virtually no
bubbles !
Asynchrony-tolerant CAF
Implementation of Sweep3D
• Multiple-versioned communication buffers
• Benefits
– Overlap PUT with computation on destination
– Overlap of synchronization with computation on source
Three-buffer Communication
From
predecessor
Compute
To
successor
Compute
To
successor
From
predecessor
To
successor
From
predecessor
Compute
Communication Throughput
Microbenchmark
• MPI implementation: blocking send and receive
• CAF one-version buffer
• CAF multi-versioned buffers
• ARMCI implementation: one buffer
Outline
• CAF programming model
• cafc
• Sweep3D implementations
Experimental evaluation
• Conclusions
Experimental Evaluation
• Platforms
– Itanium2+Quadrics QSNet II (Elan4)
– SGI Altix 3000
– Itanium2+Myrinet 2000
– Alpha+Quadrics QSNet (Elan3)
• Problem sizes
– 50x50x50
– 150x150x150
– 300x300x300
Itanium2 + Quadrics, Size 50x50x50
Itanium2 + Quadrics, Size
150x150x150
Itanium2 + Quadrics, Size
300x300x300
•multi-version buffers improve performance of
CAF codes by 15%
•imperative to use non-blocking notifies
Itanium2+Quadrics, Communication
Throughput Microbenchmark
•multi-version buffers improve throughput
•by 30 for messages up to 8KB
•by 10% for messages larger than 8KB
•overhead of the CAF translation is acceptable
SGI Altix 3000, Size 50x50x50
SGI Altix 3000, Size 150x150x150
•multi-version buffers are effective for asynchrony-tolerance
SGI Altix 3000, Size 300x300x300
• both CAF implementations outperforms MPI
SGI Altix 3000, Communication
Throughput Microbenchmark
Warm
cache
•ARMCI library exploits effectively
the hardware support for efficient
data movement
•MPI performs extra data copies
Summary of results
• MPI buffering for small messages helps latency &
asynchrony tolerance
• CAF multi-version buffers improve performance of one-sided
communication for wavefront computations
– enables PUT and receiver’s computation to overlap
– asynchrony tolerance between sender and receiver
• Non-blocking notifies are important for performance
– enables synchronization to overlap with computation
• Platform results
– CAF outperforms MPI for large problem sizes by ~10% on
Itanium2+{Quadrics,Myrinet,Altix}
– CAF ~16%slower on Alpha+Quadrics(Elan3)
• ARMCI lacks non-blocking notifies on Elan3
Enhancing CAF Usability
• CAF vs MPI usability
– easier to use than MPI for simple parallel programs
– as difficult for carefully-tuned parallel codes
• Improving CAF ease of use
– compiler support for managing multi-version communication buffers
– vectorizing fine-grain communication to best support X1 and cluster
platforms with same code
http://www.hipersoft.rice.edu/caf
Implementing Communication
x(1:n) = a(1:n)[p] + …
• Use a temporary buffer to hold off processor data
– allocate buffer
– perform GET to fill buffer
– perform computation:
x(1:n) = buffer(1:n) + …
– deallocate buffer
• Optimizations
– no temporary storage for co-array to co-array copies
– load/store communication on shared-memory systems
Detailed Results
• Itanium2+Quadrics(Elan4)
– similar for 503, 9% better for 1503 and 3003
• Alpha+Quadrics(Elan3)
– 8% better for 503, 16% lower for 1503 and similar for 3003
– ARMCI lacks non-blocking notifies on Elan3
• SGI Altix 3000
– comparable for 503 and 1503, 10% better for 3003
• Itanium2+Myrinet
– similar for 503, 12% better for 1503 and 9% better for 3003
SGI Altix 3000, communication
throughput microbenchmark
Warm
Cold
cache
cache
One- vs. Two-buffer Communication
One-buffer communication
source
dest
ddelays
delays
Two-buffers communication
source
dest
smaller
delays !
Asynchrony-tolerant CAF
Implementation
one comm. buffer
sync_notify
sync_wait
PUT
sync_notify
sync_wait
sync_notify
sync_wait
PUT
sync_notify
sync_wait
Asynchrony-tolerant CAF
Implementation
one comm. buffer
two comm. buffers
sync_notify
sync_wait
PUT
sync_notify
sync_wait
sync_wait
PUT
sync_notify
sync_notify
sync_wait
PUT
sync_notify
sync_notify
sync_notify
sync_wait
sync_notify
sync_wait
PUT
sync_wait
sync_wait
Asynchrony-tolerant CAF
Implementation
one comm. buffer
two comm. buffers
sync_notify
sync_wait
PUT
sync_notify
sync_wait
sync_wait
PUT
sync_notify
sync_notify
sync_wait
PUT
sync_notify
sync_notify
sync_notify
sync_wait
sync_notify
sync_wait
PUT
sync_wait
sync_wait
Asynchrony-tolerant CAF
Implementation
one comm. buffer
two comm. buffers
sync_notify
sync_wait
PUT
sync_notify
sync_wait
sync_wait
PUT
sync_notify
sync_notify
sync_wait
PUT
sync_notify
sync_notify
sync_notify
sync_wait
sync_notify
sync_wait
PUT
sync_wait
sync_wait