OpenMP for Networks of SMPs – Parallel Programming ECE1747

Download Report

Transcript OpenMP for Networks of SMPs – Parallel Programming ECE1747

OpenMP for
Networks of SMPs
Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel
ECE1747 – Parallel Programming
Vicky Tsang
Background
Published in the Journal of Parallel and
Distributed Computing, vol. 60 (12), pp.
1512-1530, December 2000
 Work to further improve TreadMarks
 Presents an alternative solution to MPI

Roadmap








Motivation
Solution
OpenMP API
TreadMarks
OpenMP Translator
Performance Measurement
Results
Conclusion
Motivation
To enable the programmer to reply on a
single, standard, shared-memory API for
parallelization within and between
multiprocessors.
 To provide another standard other than
MPI?

Solution
Presents the first system that implements
OpenMP on a network of shared-memory
multiprocessors
 Implemented via a translator converting
OpenMP directives to calls in modified
TreadMarks
 Modified TreadMarks uses POSIX threads
for parallelism within an SMP node

Solution

Original version of TreadMarks:
 A Unix
process was executed on each
processor of the multiprocessor node and
communication between processes was
achieved through message passing
 Fails to take advantage of hardware shared
memory
Solution

Modified version of TreadMarks
 POSIX threads used to implement parallelism
 OpenMP threads within a multiprocessor share
a
single address space
 Positive:


Reduces the number of changes to TreadMarks to support
multithreading on a multiprocessor
OS maintains the coherence of page mappings automatically
 Negative:
 More difficult to provide uniform sharing of memory between
threads on the same node and threads on different nodes
OpenMP API

Three kinds of directives:
 Parallelism/work
sharing
 Data environment
 Synchronization



Based on a fork-join model
Sequential code sections executed by master
thread
Parallel code sections are executed by all
threads, including the master thread
OpenMP API




Parallel directive – all threads perform the same
computation
Work sharing directive – computation is divided
among the threads
Data environment directive – control the sharing
of program variables
Synchronization directive – control the
synchronization between threads
TreadMarks
User-level SDSM system
 Provides a global shared address space
on top of physically distributed memories
 Key functions performed are memory
coherence and synchronization

TreadMarks – Memory Coherence

Minimize the amount of communication
performed to maintain memory consistency by:
a
lazy implementation of release consistency
 reducing the impact of false sharing by allowing
multiple concurrent writers to modify a page

Propagation of consistency information is
postponed until the time of an acquire
TreadMarks - Synchronization
Barrier implemented as acquire and
release messages
 Governed by a centralized manager

TreadMarks –
Modifications for OpenMP

Inclusion of two primitives:
 Tmk_fork
 Tmk_join


All threads created at the start of a program’s
execution to minimize overhead.
Slave threads are blocked during sequential
execution until the next Tmk_fork is issued by
the master thread.
TreadMarks – Modifications for
Networks of Multiprocessors




POSIX thread enabled sharing of data between processors. Addition
of some data structures, such as message buffers, in thread-private
memory for data that is to remain private within a thread.
A per-page mutex was added to allow greater concurrency in the
page fault handler.
Synchronization functions in TreadMarks were modified to use
POSIX thread-based synchronization between processors within a
node and existing TreadMarks synchronization functions between
nodes.
A second mapping was added for the memory that is shared
between nodes so shared-memory pages can be updated while the
first mapping remains invalid until the update is complete. This
reduces the number of page protection operations performed by
TreadMarks.
OpenMP Translator



Synchronization directives translate directly to
TreadMarks synchronization operations.
The complier translates the code sections marks
with parallel directives to fork-join code.
Data environment directives implemented to
work with both TreadMarks and POSIX threads,
hiding the interface issues from the programmer.
Performance Measurement

Platform
 IBM
SP2 consisting of four SMP nodes
 Per node:
Four IBM PowerPC 604 processors
 1 GB memory
 Running AIX 4.2

Performance Measurement

Applications
 SPLASH-2
Barnes-Hut
 NAS 3D-FFT
 SPLASH-2 CLU
 SPLASH-2 Water
 Red-Black SOR
 TSP
 Modified Gramm-Schmidt (MGS)
Results
Results
Results
Results
Conclusion



Enables the programmer to rely on a single,
standard, shared-memory API for parallelization
within and between multiprocessors.
Using shared hardware memory reduced data
and messages transmitted.
The speedups of multithreaded TreadMarks
codes on four four-way SMP SP2 nodes are
within 7-30% of the MPI versions.
Critique
Solution allows easier implementation of
program parallelization across
multiprocessors if speedup is not crucial
 OpenMP is easier on the programmer but
speedup still not as good as MPI

Critique

Issues:
 AIX


has inefficient implementation of page protection
Paper claims that every other brand of Unix, including Linux,
uses data structures that handle mprotect operations more
efficiently
Why wasn’t the solution implemented on another platform?
 Paper
failed to present a big motivation for using this
solution over MPI.
Thank You