Transcript Slide 1

COMP60621
Designing for Parallelism
Lecture 2
Parallel Programming:
Language extensions – pthreads (and MPI)
Introduction to Interference
John Gurd, Graham Riley
Centre for Novel Computing
School of Computer Science
University of Manchester
November 2012
1
Overview
– Parallel Programming Fundamentals
• Different levels of programming…
• Managing parallel work units and their interaction
– The Unix Model ( http://www.unix.org )
• Processes and threads ---- memory mapping
– Overview of pthreads - single address space
– Overview of MPI - multiple address spaces
– Summary
November 2012
2
Extensions to C
– We need specific programming constructs to define parallel
computations. We shall use (sequential) C as a starting point.
– In this lecture, we investigate extensions to C that allow the
programmer to express parallel activity in the thread-based or
data-sharing style and in the message passing style.
– We approach this from two main directions:
• Extensions that allow the programmer to create and manage
threads explicitly and interact via shared memory.
• Extensions that allow the programmer to manage processes
explicitly and exchange data via messages.
November 2012
3
Different Levels of
Thread-based Programming
– None of these schemes is fully implicit (i.e. automatic);
unfortunately, autoparallelisation of C (or any other serial)
programs is beyond the present state-of-the-art. Instead,
different schemes offer increasing amounts of high-level
assistance for the creation and management of parallel
threads.
– POSIX Parallel Threads Library
• 'Bare-metal' approach --- the programmer is responsible for
everything except the posix call implementations.
– OpenMP API (a higher level alternative for threads)
• Much functionality is provided, e.g. at the loop level --- the
programmer is presented with a simpler picture, but the
scope for losing performance through naivety increases.
November 2012
4
How to Obtain Parallel Thread-based
Activity
– The general approach to developing a parallel code is the
same in each scheme. The basic idea is to create a number of
parallel threads and then find (relatively) independent units of
work for each of them to execute.
– Units of work come in two basic types which correspond to
task- and data-parallelism:
• Functionally different subprograms, each executed once;
• Single subprogram, executed multiple times – with different data
– In general, these forms of parallelism can be nested.
– Each scheme relies on run-time support routines, provided as
part of the operating system. It is important to know how
memory (address space) is laid out at run-time. An example is
given by the UNIX system, described on the next slide – this is
similar to other operating systems.
5
November 2012
The UNIX Model:
Processes and Threads
• There are two basic units:
– A Process, the basic unit of resource.
– A thread, the basic unit of execution.
– The simplest process is one having a single thread
of execution.
• This corresponds well to our programming models.
Code is shared by all threads in a process. The
general situation is illustrated in the following slide.
– (Note: the terminology used in other operating systems is
dangerously ambiguous.)
November 2012
6
Memory Map for UNIX Processes and
Threads
OS segments
Code segment
(Process-shared data)
Task-shared
data
e.g. via mmap()
Thread-shared
Thread-shared data
data
November 2012
PC
Master stack
PC
PC
PC
Thread stack
Thread stack
Thread stack
7
POSIX Threads and an example…
• An IEEE standard for UNIX (like) systems (defined for C)
– Standard ‘wrappers’ exist to support use from FORTRAN and other
languages
• A set of library routines (and a run-time system) to manage the explicit
creation and destruction of threads, and to manage their interaction
• Essentially, a pthread executes a user-defined function
– Scheduling of work to threads is down to the user
• Calls to pthread synchronisation routines manage the interaction
between shared data
• OpenMP implementations can be built on top of pthreads
– with details hidden from user
• See: https://computing.llnl.gov/tutorials/pthreads
– Good overview; starts with a description of the relationship between
processes and threads
November 2012
8
Pthreads – simple example
#include <pthread.h>
#include <stdio.h>
#define NUM_THREADS 5
int main (int argc, char *argv[]) {
pthread_t threads[NUM_THREADS];
Int rc; long t;
for (t=0; t<NUM_THREADS; t++) {
printf ("In main: creating thread %ld\n", t);
rc = pthread_create (&threads[t], NULL, PrintHello, (void *)t);
if (rc) {
printf ("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
pthread_exit(NULL);
}
November 2012
9
void *PrintHello(void *threadid)
{
long tid;
tid = (long)threadid;
printf("Hello World! It's me, thread #%ld!\n", tid);
pthread_exit(NULL);
}
November 2012
10
Synchronisation mechanisms
• Needed to manage interference between threads modifying
shared data.
– Need to provide mutual exclusion
– Example to come…
• Locks and condition variables in pthreads
– Example calls are shown on the following slides
• Semphores (provided by the OS)
– An early (Dijkstra, 1965) mechanism to control resource sharing in
concurrent systems (provide mutual exclusion).
– Pthread synchronisation routines are frequently implemented by
OS-supported semaphores. For example, a binary semaphore is a
lock.
– More on these later…
November 2012
11
The Oriental Garden problem Interference
People enter an ornamental garden through either of two
turnstiles. Management wish to know how many are in the
garden at any time.
Garden
West
Turnstile
people
East
Turnstile
The concurrent program consists of two concurrent
threads and a shared people ‘value’ variable.
From Magee & Kramer
November 2012
12
Oriental garden program
The people count (value) and turnstile threads are created by
the Garden program as follows:
int value = 0;
main()
{
pthread_t thread1, thread2;
pthread_create( &thread1, NULL, &turnstyle, (void*)0 ); //East
pthread_create( &thread2, NULL, &turnstyle, (void*)1 ); //West
pthread_join( thread1, NULL);
pthread_join( thread2, NULL);
}
printf("Number of people on exit = %d\n",value);
exit(0);
November 2012
13
Turnstyle function
void *turnstyle(void *arg)
{
long id;
int arrive;
}
id = (long) arg;
for(arrive=0;arrive<GARDEN_MAX;arrive++)
{
value++;
}
printf("Turnstyle %d completed\n",id);
November 2012
14
Running this code
• Note, we do not consider whether the threads are
running on the same processor core (multitasking) or
on different cores (true parallelism).
• The point is that there is exploitation of concurrency
(mulitple threads) and the behaviour of the
concurrent program is independent of its deployment
onto actual hardware.
• We want to design a correct implementation
regardless of the deploment.
November 2012
15
Graphic of possible outcome
After the East and West turnstile threads have each
incremented the people count 20 times, the garden
people counter is not the sum of the counts displayed.
Counter increments have been lost. Why?
November 2012
Magee & Kramer
16
Concurrent processes!
Turnstyle threads for east and west may be executing
the code for the increment function ‘at the same time’.
west
PC
program
counter
shared code
increment:
read value
east
PC
program
counter
write value + 1
Without some form of locking to ensure mutual
exclusion, writes can be lost and the wrong total computed.
November 2012
17
Semaphores
• Note: using Semaphores, typical routines are:
• semaphore_wait(s), equivalent to
pthread_mutex_lock()
• semaphore_signal(s), equivalent to
pthread_mutex_unlock()
• Where s is a semaphore initialised to 1 (a binary
semaphore)
• We will see semaphores used in lab exercise 1
November 2012
18
Summary of Threads approach
– Programming multiple threads with explicit accesses to shared
data requires attention to detail and much low level control.
– This can be alleviated by providing the programmer with a
high-level data-sharing model and leaving low-level problems
to the implementation (e.g. OpenMP). Higher level
abstractions make programming increasingly easier, but they
provide more opportunity for performance to be lost, due to
unforseen actions by the compiler or run-time system.
– Experience shows that it is somewhat easier to program using
threads, compared to other approaches we shall study,
although it is still non-trivial.
November 2012
19
Overview of MPI
• This is review material from COMP60611
• Processes (no shared data, so message
passing) versus threads (shared data)
• Process-based Programming Fundamentals
– Managing Processes
– Passing Messages
– The Message-Passing Interface (MPI)
• See MPI forum on web
• Summary
November 2012
20
Parallel Computing with Multiple
Processes
• For anyone familiar with concurrent execution of processes
under a conventional uni-processor operating system, such as
Unix, the notion of parallel computing with multiple (singlethreaded) processes is quite natural.
• Each process is essentially a stand-alone sequential program,
with some form of interprocess communication mechanism
provided to permit controlled interaction with other processes.
• The simplest form of interprocess communication mechanism is
via input and output files. However, this does not allow very
'rich' forms of interaction.
• Hence, more complex varieties of message-passing have
evolved, e.g.:
– UNIX pipes, sockets, MPI, Java RMI…
November 2012
21
Message-Passing
• The process-based approach to parallel
programming is the main alternative to the threadbased approach. Again, we use (sequential) C as a
starting point.
• We will look at an extension to C which allow the
programmer to express parallel activity in the
message-passing style.
– Extensions that allow the programmer to send and receive
messages explicitly (to exchange program data and
synchronise). We illustrate this using the Message-Passing
Interface (MPI) standard library.
– MPI can also be used from FORTRAN and C++ (research
versions for, for example, Java are available too).
November 2012
22
Why MPI?
• Shared memory computers tend to be limited in size (numbers
of processors) and the cost of hardware to maintain cache
coherency across an interconnect grows rapidly with system
size. So the ‘biggest’ computers do not support shared memory.
• Distributed memory systems are relatively cheap to build. They
are essentially multiple copies of ‘independent’ processors
connected together. Interconnects for these are relatively simple
and cheap (e.g. based on routers). For example:
– Networks of workstations, NoWs, using Ethernet or Myrinet
– Supercomputers with specialised router-based interconnects:
HECToR, a Cray XT4 using fast SeaStar routers – more than
22,000 cores (5664 quad-core Opterons). Upgraded in 2010.
• Most of the Top100 computers in the world are DM systems and
MPI is the de-facto standard for writing parallel programs for
them (at least in the scientific world). See: www.top500.org.
November 2012
23
Managing Processes
• Remember: in our UNIX view, a process is a (virtual)
address space with one or more threads (program
counter plus stack). Processes are independent!
• A key requirement is to be able to create a new
process and know its (unique) identity. With process
identities known to one another in this way, it is
feasible within any process to construct a message
and direct it specifically to some other process.
• MPI has the concept of process ‘groups’ through
communicators, e.g. MPI_COMM_WORLD.
• Finally, there needs to be a mechanism for causing a
process to ‘die’ and allow the MPI ‘group’ to ‘die’.
November 2012
24
Passing Messages
• The fundamental requirements for passing a
message between two processes are:
– The sending process knows how to direct a message to an
appropriate receiving process.
– In MPI this is achieved explicitly by naming a process id or
through the use of a communicator (naming a group of
processes).
– There are several models of interacting and synchronising
processes in MPI. We shall keep it simple and look only at
basic sending and receiving:
• Where recvs block but sends do not (implying buffering of the
data)
• MPI also supports synchronous (sends block until a receive is
posted) and asynchronous sends and recvs (using poling)
November 2012
25
The MPI C Library
• Typical MPI scientific codes use processes that are
identical, thus implementing the Single-ProgramMultiple-Data (SPMD) scheme. For example:
mpiexec –n 4 a.out
! Runs a 4 process single SPMD job
• MPI also supports the Multiple-Program-MultipleData (MPMD) scheme, in which different code is
executed in each process:
mpiexec –n 3 a.out : -n 4 b.out : -n 6 c.out ! An MPMD job
• Chapter 8 in Foster's book is a good source for
additional MPI information. See also, LLNL tutorials.
November 2012
26
MPI Fundamentals
• Processes are grouped together; they are numbered within a group
•
using contiguous integers, starting from 0. Messages are passed using
the send (MPI_SEND) and receive (MPI_RECV) library subroutine
calls (many other forms exist!)
A message send has the general form:
MPI_Send(sbuf,icount,itype,idest,itag,icomm,ierr)
– A send may block or not – depends on the MPI implementation’s use of
buffering. (MPI_Ssend is a guaranteed blocking send)
– Programs should not assume buffering of sends! Can lead to deadlock (see
later example).
• A message receive has the general form:
•
MPI_Recv(rbuf,icount,itype,isrce,itag,icomm,istat,ierr)
The receiving process blocks until a message of the appropriate kind
becomes available. The buffer starting at rbuf has to be guaranteed to
be large enough to hold icount elements. The istat parameter shows
how many elements actually arrived, where from, etc.
November 2012
27
MPI Fundamentals
• There are four other core library functions, illustrated
in the following slides (which uses C syntax)
• The following slides show the MPI basics plus the
skeleton on an n-body MPI application in the SPMDstyle.
November 2012
28
#include “mpi.h”
/* include file of compile-time constants needed for MPI library calls */
/* main program */
main (int argc, char *argv[]) {
/* call to initialise this process - called once only per process */
ierr = MPI_Init(&argc, &argv);
/* find the number of processes */
MPI_Comm_size(MPI_COMM_WORLD, &np);
/* find the id (number) of this process */
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
/* print a “Hello world” message from this process */
fprintf (“I am %d of %d processes!”, myid, nprocs);
/* shut down this process - last thing a process should do
MPI_Finalize();
}
November 2012
29
#include “mpi.h”
/* include file */
main(int argc, char *argv[]) {
/* main program */
int myid, np, ierr, lnbr, rnbr;
Real x[300], buff[300], forces[300];
MPI_Status status;
ierr = MPI_Init(&argc, &argv);
/* initialize */
if (ierr != MPI_SUCCESS) {
/* check return code */
fprintf(stderr,”MPI initialisation error\n”);
exit(1);
}
November 2012
30
MPI_Comm_size(MPI_COMM_WORLD, &np); /* nprocs */
MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* my process id */
lnbr = (myid+np-1)%np;
/* id of left neighbour */
rnbr = (myid+1)%np;
/* id of right nbr */
Initialize(x, buff, forces);
for (i=0; i<np-1; i++) {
/* circulate messages */
/* Note: assumes sends do not block! What if they do?*/
MPI_Send(buff, 300, MPI_FLOAT, rnbr, 0, MPI_COMM_WORLD);
MPI_Recv(buff, 300, MPI_FLOAT, lnbr, 0, MPI_COMM_WORLD,
&status);
update_forces(x, buff, forces);
}
Print _forces(myid, forces);
/* print result */
MPI_Finalize();
/* shutdown */
}
November 2012
31
Other MPI Facilities
• The tag parameter is used to match up an input message with a
•
•
•
specific expected kind. If the kind of message is immaterial,
MPI_ANY_TAG will match with anything.
There are also constructs for: global 'barrier' synchronisation
(MPI_Barrier); transfer of data, including one-to-many 'broadcast'
(MPI_Bcast) and 'scatter' (MPI_Scatter), and many-to-one 'gather'
(MPI_Gather); and 'reduction' operators (MPI_Reduce and
MPI_All_reduce).
A reduction has the general form:
MPI_Reduce(src,result,icnt,ityp,op,iroot,icomm,ierr)
where op is the operator, typ is the element type, and root is the
number of the process that will receive the reduced result. All
processes in the group receive the same result when MPI_All_reduce
is used.
There are many other features, but these are too numerous to be
studied further here.
November 2012
32
MPI – pros and cons
• MPI is the de-facto standard for programming large
supercomputers because the current trend is only to build
distributed memory machines.
• The vast majority of current DM machines are built out of
multicore processors
– Mixed mode programming with MPI ‘outside’ and pthreads (or
OpenMP) ‘inside’ is possible…
• MPI forces the programmer to face up to the distributed nature
of machines – is this a good thing?
• MPI solutions tend to be more scalable than ptrhread (or
OpenMP) solutions
• (OpenMP is somewhat easier to use…)
November 2012
33
Summary of MPI
• Process-based programming using a library such as
MPI for explicit passing of messages requires
attention to detail and much low level description of
activities.
• Ultimately, the same underlying problems of
parallelism emerge, regardless of whether the shared
memory (e.g. pthreads or OpenMP) or distributed
memory (e.g. MPI) programming approach is used.
November 2012
34
Summary
• Pthreads exploit parallelism by exploiting multiple
threads within a single process
– A single address space model
• MPI exploits parallelism between processes and
supports the explicit exchange of messages between
processes
– A multiple address space model
• Note that MPI and pthreads can be nested!
– Multiple (p)threads can execute inside each (MPI) process
– An approach that appears to match the hierarchical
architecture of modern computers (i.e. muticore processors
in a distributed memory machine, e.g. HeCToR)
November 2012
35