Transcript Lecture 3
Parallel Scientific Computing:
Algorithms and Tools
Lecture #3
APMA 2821A, Spring 2008
Instructors: George Em Karniadakis
Leopold Grinberg
1
Levels of Parallelism
Job level parallelism: Capacity computing
Goal: run as many jobs as possible on a system for
given time period. Concerned about throughput;
Individual user’s jobs may not run faster.
Of interest to administrators
Program/Task level parallelism: Capability
computing
Use multiple processors to solve a single problem.
Controlled by users.
Instruction level parallelism:
Pipeline, multiple functional units, multiple cores.
Invisible to users.
Bit-level parallelism:
Of concern to hardware designers of arithmetic-logic
units
2
Granularity of Parallel Tasks
Large/coarse grain parallelism:
Amount of operations that run in parallel is
fairly large
e.g., on the order of an entire program
Small/fine grain parallelism:
Amount of operations that run in parallel is
relatively small
e.g., on the order of single loop.
Coarse/large grains usually result in more favorable parallel performance
3
Flynn’s Taxonomy of Computers
SISD: Single instruction stream, single data
stream
MISD: Multiple instruction streams, single data
stream
SIMD: Single instruction stream, multiple data
streams
MIMD: Multiple instruction streams, multiple data
streams
4
Classification of Computers
SISD: single instruction single data
Conventional computers
CPU fetches from one instruction stream and
works on one data stream.
Instructions may run in parallel (superscalar).
MISD: multiple instruction single data
No real world implementation.
5
Classification of Computers
SIMD: single instruction multiple data
Controller + processing elements (PE)
Controller dispatches an instruction to PEs; All PEs
execute same instruction, but on different data
e.g., MasPar MP-1, Thinking machines CM-1, vector
computers (?)
MIMD: multiple instruction multiple data
Processors execute own instructions on different data
streams
Processors communicate with one another directly, or
through shared memory.
Usual parallel computers, clusters of workstations
6
Flynn’s Taxonomy
7
Programming Model
SPMD: Single program multiple data
MPMD: multiple programs multiple data
8
Programming Model
SPMD: Single program multiple data
Usual parallel programming model
All processors execute same program, on
multiple data sets (domain decomposition)
Processor knows its own ID
• if(my_cpu_id == N){}
• else {}
9
Programming Model
MPMD: Multiple programs multiple data
Different processors execute different
programs, on different data
Usually a master-slave model is used.
• Master CPU spawns and dispatches computations
to slave CPUs running a different program.
Can be converted into SPMD model
• if(my_cpu_id==0) run
function_containing_program_1;
• else run
function_containing_program_2;
10
Classification of Parallel Computers
Flynn’s MIMD computers contain a wide
variety of parallel computers
Based on memory organization (address
space):
Shared-memory parallel computers
• Processors can access all memories
Distributed-memory parallel computers
• Processor can only access local memory
• Remote memory access through explicit
communication
11
Shared-Memory Parallel Computer
Superscalar processors with L2
cache connected to memory
modules through a bus or
crossbar
All processors have access to all
machine resources including
memory and I/O devices
SMP (symmetric multiprocessor):
if processors are all the same
and have equal access to
machine resources, i.e. it is
symmetric.
SMP are UMA (Uniform Memory
Access) machines
e.g., A node of IBM SP machine;
SUN Ultraenterprise 10000
memory
M1
M2
M3 …
Mn
Bus or crossbar
C
C
C
P1
P2
P3
C
…
Pn
Prototype shared-memory parallel computer
P – processor; C – cache; M – memory.
12
Shared-Memory
Parallel Computer
If bus,
Only one processor can access
the memory at a time.
Processors contend for bus to
access memory
M1
…
M2
Mn
memory
bus
C
C
C
P1
P2
…
Pn
If crossbar,
Multiple processors can access
memory through independent
paths
Contention when different
processors access same
memory module
Crossbar can be very expensive.
memory
M1
M2
M3 …
Mn
crossbar
Processor count limited by
memory contention and
bandwidth
C
C
C
Max usually 64 or 128
P1
P2
P3
C
…
Pn
13
Shared-Memory Parallel Computer
Data flows from memory to cache, to
processors
Performance depends dramatically on
reuse of data in cache
Fetching data from memory with potential
memory contention can be expensive
L2 cache plays of the role of local fast
memory; Shared memory is analogous to
extended memory accessed in blocks
14
Cache Coherency
If a piece of data in one processor’s cache
is modified, then all other processors’
cache that contain that data must be
updated.
Cache coherency: the state that is
achieved by maintaining consistent values
of same data in all processors’ caches.
Usually hardware maintains cache
coherency; System software can also do
this, but more difficult.
15
Programming Shared-Memory
Parallel Computers
All memory modules have the same global
address space.
Closest to single-processor computer
Relatively easy to program.
Multi-threaded programming:
Auto-parallelizing compilers can extract fine-grain
(loop-level) parallelism automatically;
Or use OpenMP;
Or use explicit POSIX (portable operating system
interface) threads or other thread libraries.
Message passing:
MPI (Message Passing Interface).
16
Distributed-Memory Parallel Computer
Superscalar processors
with local memory
connected through
communication network.
Each processor can only
work on data in local
memory
Access to remote
memory requires explicit
communication.
Present-day large
supercomputers are all
some sort of distributedmemory machines
Communication Network
P1
P2
M
M
…
Pn
M
Prototype distributed-memory computer
e.g. IBM SP, BlueGene; Cray XT3/XT4
17
Distributed-Memory Parallel Computer
High scalability
No memory contention such as those in
shared-memory machines
Now scaled to > 100,000 processors.
Performance of network connection crucial
to performance of applications.
Ideal: low latency, high bandwidth
Communication much slower than local memory read/write
Data locality is important. Frequently used data local memory
18
Programming Distributed-Memory
Parallel Computer
“Owner computes” rule
Problem needs to be broken up into independent tasks with
independent memory
Each task assigned to a processor
Naturally matches data based decomposition such as a domain
decomposition
Message passing: tasks explicitly exchange data by
message passing.
Transfers all data using explicit send/receive instructions
User must optimize communications
Usually MPI (used to be PVM), portable, high performance
Parallelization mostly at large granularity level controlled
by user
Difficult for compilers/auto-parallelization tools
19
Programming Distributed-Memory
Parallel Computer
A global address space is provided on some distributedmemory machine
Memory physically distributed, but globally addressable; can be
treated as “shared-memory” machine; so-called distributed
shared-memory.
Cray T3E; SGI Altix, Origin.
Multi-threaded programs (OpenMP, POSIX threads) can also be
used on such machines
User accesses remote memory as if it were local; OS/compilers
translate such accesses to fetch/store over the communication
network.
But difficult to control data locality; performance may suffer.
NUMA (non-uniform memory access); ccNUMA (cache coherent
non-uniform memory access); overhead
20
Hybrid Parallel Computer
Overall distributed
memory, SMP
nodes
Most modern
supercomputers and
workstation clusters
are of this type
Message passing;
or hybrid message
passing/threading.
Communication network
M
M
M
Bus or crossbar
M
Bus or crossbar
……
P
P
P
P
Hybrid parallel computer
e.g. IBM SP, Cray XT3
21
Interconnection Network/Topology
Ring
Fully connected network
Nodes, links
Neighbors: nodes with a link between them
Degree of a node: number of neighbors it has
Scalability: increase in complexity when more nodes are added.
22
Topology
Hypercube
23
Topology
3D mesh/torus
1D/2D mesh/torus
24
Topology
Tree
Star
25
Topology
Bisection width: minimum number of links
that must be cut in order to divide the
topology into two independent networks of
the same size (plus/minus one node)
Bisection bandwidth: communication
bandwidth across the links that are cut in
defining bisection width
Larger bisection bandwidth better
26