OpenMP for Networks of SMPs – Parallel Programming ECE1747
Download
Report
Transcript OpenMP for Networks of SMPs – Parallel Programming ECE1747
OpenMP for
Networks of SMPs
Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel
ECE1747 – Parallel Programming
Vicky Tsang
Background
Published in the Journal of Parallel and
Distributed Computing, vol. 60 (12), pp.
1512-1530, December 2000
Work to further improve TreadMarks
Presents an alternative solution to MPI
Roadmap
Motivation
Solution
OpenMP API
TreadMarks
OpenMP Translator
Performance Measurement
Results
Conclusion
Motivation
To enable the programmer to reply on a
single, standard, shared-memory API for
parallelization within and between
multiprocessors.
To provide another standard other than
MPI?
Solution
Presents the first system that implements
OpenMP on a network of shared-memory
multiprocessors
Implemented via a translator converting
OpenMP directives to calls in modified
TreadMarks
Modified TreadMarks uses POSIX threads
for parallelism within an SMP node
Solution
Original version of TreadMarks:
A Unix
process was executed on each
processor of the multiprocessor node and
communication between processes was
achieved through message passing
Fails to take advantage of hardware shared
memory
Solution
Modified version of TreadMarks
POSIX threads used to implement parallelism
OpenMP threads within a multiprocessor share
a
single address space
Positive:
Reduces the number of changes to TreadMarks to support
multithreading on a multiprocessor
OS maintains the coherence of page mappings automatically
Negative:
More difficult to provide uniform sharing of memory between
threads on the same node and threads on different nodes
OpenMP API
Three kinds of directives:
Parallelism/work
sharing
Data environment
Synchronization
Based on a fork-join model
Sequential code sections executed by master
thread
Parallel code sections are executed by all
threads, including the master thread
OpenMP API
Parallel directive – all threads perform the same
computation
Work sharing directive – computation is divided
among the threads
Data environment directive – control the sharing
of program variables
Synchronization directive – control the
synchronization between threads
TreadMarks
User-level SDSM system
Provides a global shared address space
on top of physically distributed memories
Key functions performed are memory
coherence and synchronization
TreadMarks – Memory Coherence
Minimize the amount of communication
performed to maintain memory consistency by:
a
lazy implementation of release consistency
reducing the impact of false sharing by allowing
multiple concurrent writers to modify a page
Propagation of consistency information is
postponed until the time of an acquire
TreadMarks - Synchronization
Barrier implemented as acquire and
release messages
Governed by a centralized manager
TreadMarks –
Modifications for OpenMP
Inclusion of two primitives:
Tmk_fork
Tmk_join
All threads created at the start of a program’s
execution to minimize overhead.
Slave threads are blocked during sequential
execution until the next Tmk_fork is issued by
the master thread.
TreadMarks – Modifications for
Networks of Multiprocessors
POSIX thread enabled sharing of data between processors. Addition
of some data structures, such as message buffers, in thread-private
memory for data that is to remain private within a thread.
A per-page mutex was added to allow greater concurrency in the
page fault handler.
Synchronization functions in TreadMarks were modified to use
POSIX thread-based synchronization between processors within a
node and existing TreadMarks synchronization functions between
nodes.
A second mapping was added for the memory that is shared
between nodes so shared-memory pages can be updated while the
first mapping remains invalid until the update is complete. This
reduces the number of page protection operations performed by
TreadMarks.
OpenMP Translator
Synchronization directives translate directly to
TreadMarks synchronization operations.
The complier translates the code sections marks
with parallel directives to fork-join code.
Data environment directives implemented to
work with both TreadMarks and POSIX threads,
hiding the interface issues from the programmer.
Performance Measurement
Platform
IBM
SP2 consisting of four SMP nodes
Per node:
Four IBM PowerPC 604 processors
1 GB memory
Running AIX 4.2
Performance Measurement
Applications
SPLASH-2
Barnes-Hut
NAS 3D-FFT
SPLASH-2 CLU
SPLASH-2 Water
Red-Black SOR
TSP
Modified Gramm-Schmidt (MGS)
Results
Results
Results
Results
Conclusion
Enables the programmer to rely on a single,
standard, shared-memory API for parallelization
within and between multiprocessors.
Using shared hardware memory reduced data
and messages transmitted.
The speedups of multithreaded TreadMarks
codes on four four-way SMP SP2 nodes are
within 7-30% of the MPI versions.
Critique
Solution allows easier implementation of
program parallelization across
multiprocessors if speedup is not crucial
OpenMP is easier on the programmer but
speedup still not as good as MPI
Critique
Issues:
AIX
has inefficient implementation of page protection
Paper claims that every other brand of Unix, including Linux,
uses data structures that handle mprotect operations more
efficiently
Why wasn’t the solution implemented on another platform?
Paper
failed to present a big motivation for using this
solution over MPI.
Thank You