Transcript Slide 1

COMP60611
Fundamentals of Parallel
and Distributed Systems
Lecture 3
Introduction to Parallel Computers
John Gurd, Graham Riley
Centre for Novel Computing
School of Computer Science
University of Manchester
Combining the strengths of UMIST and
The Victoria University of Manchester
Overview
• We focus on the lower of the two implementationoriented Levels of Abstraction
– The Computer Level
• Von Neumann sequential architecture
• two fundamental ways to go parallel
– shared memory
– distributed memory
• implications for programming language implementations
– Summary
20/07/2015
2
Conventional View of Computer
Architecture
• We start by recalling the traditional view of a state-based
computer, as first expounded by John von Neumann (1945).
• A finite word-at-a-time memory is attached to a Central
Processing Unit (CPU) (a “single core” processor). The memory
contains the fixed code and initial data defining the program to
be executed. The CPU contains a Program Counter (PC) which
points to the first instruction to be executed. The CPU follows
the instruction execution cycle, similar to the state-transition
cycle described earlier for the Program Level.
• The memory is accessed solely by requests of the following
form:
<read,address>
or
<write,data,address>
20/07/2015
3
Conventional View of Computer
Architecture
This arrangement is conveniently illustrated by the following diagram:
PC
CPU
memory interface
memory
20/07/2015
4
Conventional View of Computer
Architecture
• A memory address is a unique, global identifier for a memory
location – the same address always accesses (the value held in)
the same location of memory. The range of possible addresses
defines the address space of the computer.
• All addresses 'reside' in one logical memory; there is therefore
only one interface between CPU and memory.
• Memory is often organised as a hierarchy, and the address
space can be virtual; i.e. two requests to the same location may
not physically access the same part of memory --- this is what
happens, for example, in systems with cache memory, or with a
disk-based paged virtual memory. Access times to a virtualaddressed memory vary considerably depending where the
addressed item currently resides.
20/07/2015
5
Parallel Computer Architecture
• Imagine that we have a parallel program consisting of two
threads or two processes, A and B. We need two program
counters to run them. There are two ways of arranging this:
• Time-sharing – we use a sequential computer, as described
above, and arrange that A and B are periodically allowed to use
the (single) CPU to advance their activity. When the “current”
thread or process is deselected from the CPU, its entire state is
remembered so that it can restart from the latest position when it
is reselected.
• Multiprocessor – we build a structure with two separate CPUs,
both accessing a common memory. Code A and Code B will be
executed on the two different CPUs.
• Time-sharing is slow (and hence not conducive to high
performance), but does get used in certain circumstances.
However, we shall ignore it from now on.
20/07/2015
6
Parallel Computer Architecture
In diagrammatic form, the multiprocessor appears as follows:
PC 1
CPU
PC 2
1
CPU 2
memory interface
memory
20/07/2015
7
Parallel Computer Architecture
• The structure on the previous slide is known as a
shared memory multiprocessor, for obvious
reasons.
• The memory interface is the same as for the
sequential architecture, and both memory and
addresses retain the same properties.
– However, read accesses to memory now have to remember
which CPU they came from so that the read data can be
returned to the correct CPU.
20/07/2015
8
Parallel Computer Architecture
• But access to the common memory is subject to contention,
when both CPUs try to access memory at the same time. The
greater the number of parallel CPUs, the worse this contention
problem becomes.
• A commonly used solution is to split the memory into multiple
banks which can be accessed simultaneously. This
arrangement is shown below:
PC 1
CPU
PC
1
p
CPU
p
interconnect
.
20/07/2015
memory
1
memory
2
memory
m
9
Parallel Computer Architecture
• In this arrangement, the interconnect directs each memory
access to an appropriate memory bank, according to the
required address. Addresses may be allocated across the
memory banks in many different ways (interleaved, in blocks,
etc.).
• The interconnect could be a complex switch mechanism, with
separate paths from each CPU to each memory bank, but this is
expensive in terms of physical wiring.
• Hence, cheap interconnect schemes, such as a bus, tend to be
used. However, these limit the number of CPUs and memory
banks that can be connected together (to a maximum of around
30).
20/07/2015
10
Parallel Computer Architecture
• Two separate things motivate the next refinement:
– Firstly, we can double the capacity of a bus by physically colocating a CPU and a memory bank and letting them share the
same bus interface.
– Secondly, we know from analysis of algorithms (and the
development of programs from them) that many of the required
variables are private to each thread. By placing private variables in
the co-located memory, we can avoid having to access the bus in
the first place.
– Indeed, we don't really need to use a bus for the interconnect.
• The resulting structure has the memory physically distributed
amongst the CPUs. Each CPU-plus-memory resembles a von
Neumann computer, and the structure is called a distributed
memory multicomputer.
20/07/2015
11
Parallel Computer Architecture
The architecture diagram for a distributed memory
multicomputer is shown below:
PC
PC
1
CPU
1
memory11
memory
p
CPU
p
memory
p
interconnect
20/07/2015
12
Distributed Computer Architecture
• Some distributed memory multicomputer systems
have a single address space in which the available
addresses are partitioned across the memory banks.
– These typically require special hardware support in the
interconnect.
• Others have multiple address spaces in which each
CPU is able to issue addresses only to its 'own' local
memory bank.
• Finally, interconnection networks range from very
fast, very expensive, specialised hardware to ‘the
Internet’.
20/07/2015
13
Parallel Computer Architecture
• The operation of the single address space version of this architecture,
known as distributed shared memory (DSM) is logically unchanged
from the previous schemes (shared memory multiprocessor).
• However, some memory accesses only need to go to the physically
attached local memory bank, while others, according to the address,
have to go through the interconnect. This leads to different access
times for different memory locations, even in the absence of contention.
• This latter property makes distributed shared memory a non-uniform
memory access (NUMA) architecture. For high performance, it is
essential to place code and data for each thread or process in readily
accessible memory banks.
• In multiple address space versions of this architecture (known as
distributed memory or DM), co-operative parallel action has to be
implemented by message-passing software (at least at the level of the
runtime system).
20/07/2015
14
Parallel Computer Architecture
• Note that cache memories can be used to solve some of the
problems raised earlier; e.g. to reduce the bus traffic in a
distributed shared memory architecture.
• Many systems have a NUMA structure, but their single address
space is virtual. This arrangement is sometimes referred to as
virtual shared memory (VSM).
• The effect of VSM can be implemented on a DM system entirely
in software, in which case it is usually called distributed virtual
shared memory (DVSM).
• Most very large systems today consist of many shared memory,
multicore nodes, connected via some form of interconnect.
20/07/2015
15
The Advent of Multicore
• A modern multicore processor is essentially a NUMA
shared memory multiprocessor on a chip.
• Consider a recent offering from AMD, the Opteron
quad-core processor:
– the next slides show a schematic of a single quad-core
processor and a shared memory system consisting of four
quad-core processors, i.e. a “quad-quad-core” system, with
a total of 16 cores.
• The number of cores per processor chip is rising
rapidly (to keep up with “Moore’s law”).
• Large systems connect thousands of multi-core
processors.
20/07/2015
16
Processor: Quad-Core AMD Opteron
Source: www.amd.com, Quad-Core AMD Opteron Product Brief
20/07/2015
17
AMD Opteron 4P server architecture
Source: www.amd.com, AMD 4P Server and Workstation Comparison
20/07/2015
18
Summary
• The transition from a sequential computer architecture to a
parallel computer architecture can be made in three distinct
ways:
– shared memory multiprocessor
– distributed shared memory multiprocessor
– distributed memory multicomputer
• Nothing prevents more than one of these forms of architecture
being used together. They are simply different ways of
introducing parallelism at the hardware level.
• Modern large systems are hybrids with shared memory structure
at low level and distributed memory at high level.
20/07/2015
19
From Program to Computer
The final part of this jigsaw is to know how parallel programs get
executed in practical parallel and distributed computers. Recall the
nature of the parallel programming constructs introduced earlier; we
consider their implementation, in both the run-time software library
and the underlying hardware.
Program
Message-
Data-
Level
passing
sharing
Run-time
Process
Thread
Library
support
support
Underlying
Distributed Memory
Multicomputer
Shared Memory
Multiprocessor
Hardware
20/07/2015
20
From Program to Computer
• It is perhaps tempting to think that message-passing somehow
‘belongs’ to distributed memory architecture, and data-sharing
‘belongs’ to shared memory architecture (in other words, that
programs in the programming model ‘map’ naturally and
efficiently to the ‘belonging’ hardware).
• But this is not necessarily the case. Either kind of programming
model may be (and has been) implemented on either kind of
architecture. The two derivations (from sequential to parallel, in
software and in hardware) are completely independent of one
another.
• The key issue in practice is how much ‘overhead’ is introduced
by the implementation of each parallel programming construct
on a particular parallel architecture.
20/07/2015
21
The Story So Far:
Application-Oriented View
• Solving a computational problem involves design and
development at several distinct Levels of Abstraction. The
totality of issues to be considered well exceeds the capabilities
of a normal human being.
• At Application Level, the description of the problem to be solved
is informal. A primary task in developing a solution is to create a
formal (mathematical) application model, or specification.
Although formal and abstract, an application model implies
computational work that must be done and so ultimately
determines the performance that can be achieved in an
implementation.
• Algorithms are procedures, based on discrete data domains, for
solving (approximations to) computational problems. An
algorithm is also abstract, although it is generally more clearly
related to the computer that will implement it than is the
corresponding specification.
20/07/2015
22
The Story So Far:
Implementation-Oriented View
• Concrete implementation of an algorithm is achieved through the
medium of a program, which determines how the discrete data domains
inherent in the algorithm will be laid out in the memory of the executing
computer, and also defines the operations that will be performed on
that data, and their relative execution order. Interest is currently
focused on parallel execution using parallel programming languages,
based on multiple active processes (with message-passing) or multiple
threads (with data-sharing).
• Performance is ultimately dictated by the available parallel platform, via
the efficiency of its support for processes or threads. Hardware
architectures are still evolving, but a clear trend is emerging towards
distributed memory structure, with various levels of support for sharing
data at the Program Level.
• Correctness requires an algorithm which is correct with respect to the
specification, AND a correct implementation of the algorithm (as well as
the correct operation of computer hardware and network infrastructure).
20/07/2015
23