PPP – Lecture 1
Download
Report
Transcript PPP – Lecture 1
Parallel Programming
Sathish S. Vadhiyar
Motivations of Parallel
Computing
Parallel Machine: a computer system with more than one
processor
Motivations
• Faster Execution time due to non-dependencies between
regions of code
• Presents a level of modularity
• Resource constraints. Large databases.
• Certain class of algorithms lend themselves
• Aggregate bandwidth to memory/disk. Increase in data
throughput.
•
Clock rate improvement in the past decade – 40%
•
Memory access time improvement in the past decade
– 10%
2
Parallel Programming and
Challenges
Recall the advantages and motivation of
parallelism
But parallel programs incur overheads
not seen in sequential programs
Communication delay
Idling
Synchronization
3
Challenges
P0
P1
Idle time
Computation
Communication
Synchronization
4
How do we evaluate a parallel
program?
Execution time, Tp
Speedup, S
S(p, n) = T(1, n) / T(p, n)
Usually, S(p, n) < p
Sometimes S(p, n) > p (superlinear speedup)
Efficiency, E
E(p, n) = S(p, n)/p
Usually, E(p, n) < 1
Sometimes, greater than 1
Scalability – Limitations in parallel computing,
relation to n and p.
5
Speedups and efficiency
S
E
Ideal
p
Practical
p
6
Limitations on speedup –
Amdahl’s law
Amdahl's law states that the performance
improvement to be gained from using some
faster mode of execution is limited by the
fraction of the time the faster mode can be
used.
Overall speedup in terms of fractions of
computation time with and without
enhancement, % increase in enhancement.
Places a limit on the speedup due to parallelism.
Speedup = 1
(fs + (fp/P))
7
Amdahl’s law Illustration
S = 1 / (s + (1-s)/p)
1
Efficiency
0.8
0.6
0.4
0.2
Courtesy:
0
0
5
10
http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html
http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm
15
Amdahl’s law analysis
f
P=1
P=4
P=8
P=16
P=32
1.00
1.0
4.00
8.00
16.00
32.00
0.99
1.0
3.88
7.48
13.91
24.43
0.98
1.0
3.77
7.02
12.31
19.75
0.96
1.0
3.57
6.25
10.00
14.29
•For the same fraction, speedup numbers keep moving away from
processor size.
•Thus Amdahl’s law is a bit depressing for parallel programming.
•In practice, the number of parallel portions of work has to be large
enough to match a given number of processors.
Gustafson’s Law
Amdahl’s law – keep the parallel work fixed
Gustafson’s law – keep computation time
on parallel processors fixed, change the
problem size (fraction of parallel/sequential
work) to match the computation time
For a particular number of processors, find
the problem size for which parallel time is
equal to the constant time
For that problem size, find the sequential
time and the corresponding speedup
Thus speedup is scaled or scaled speedup.
Also called weak speedup
Metrics (Contd..)
Table 5.1: Efficiency as a function of n and p.
N
64
192
512
P=1
1.0
1.0
1.0
P=4
0.80
0.92
0.97
P=8
0.57
0.80
0.91
P=16
0.33
0.60
0.80
P=32
11
Scalability
Efficiency decreases with increasing P;
increases with increasing N
How effectively the parallel algorithm can use
an increasing number of processors
How the amount of computation performed
must scale with P to keep E constant
This function of computation in terms of P is
called isoefficiency function.
An algorithm with an isoefficiency function of
O(P) is highly scalable while an algorithm with
quadratic or exponential isoefficiency function
is poorly scalable
12
Parallel Program Models
Single Program
Multiple Data (SPMD)
Multiple Program
Multiple Data
(MPMD)
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
13
Programming Paradigms
Shared memory model – Threads, OpenMP,
CUDA
Message passing model – MPI
14
PARALLELIZATION
Parallelizing a Program
Given a sequential program/algorithm, how to
go about producing a parallel version
Four steps in program parallelization
1. Decomposition
Identifying parallel tasks with large extent of
possible concurrent activity; splitting the problem
into tasks
2. Assignment
Grouping the tasks into processes with best load
balancing
3. Orchestration
Reducing synchronization and communication costs
4. Mapping
Mapping of processes to processors (if possible)
16
Steps in Creating a Parallel
Program
Partitioning
D
e
c
o
m
p
o
s
i
t
i
o
n
Sequential
computation
A
s
s
i
g
n
m
e
n
t
Tasks
p0
p1
p2
p3
Processes
O
r
c
h
e
s
t
r
a
t
i
o
n
p0
p1
p2
p3
Parallel
program
M
a
p
p
i
n
g
P0
P1
P2
P3
Processors
17
Orchestration
Goals
Structuring communication
Synchronization
Challenges
Organizing data structures – packing
Small or large messages?
How to organize communication and
synchronization ?
18
Orchestration
Maximizing data locality
Minimizing volume of data exchange
Not communicating intermediate results – e.g. dot product
Minimizing frequency of interactions - packing
Minimizing contention and hot spots
Do not use the same communication pattern with the
other processes in all the processes
Overlapping computations with interactions
Split computations into phases: those that depend on
communicated data (type 1) and those that do not (type
2)
Initiate communication for type 1; During
communication, perform type 2
Replicating data or computations
Balancing the extra computation or storage cost with 19
the gain due to less communication
Mapping
Which process runs on which particular
processor?
Can depend on network topology, communication
pattern of processors
On processor speeds in case of heterogeneous
systems
20
Mapping
Static mapping
Mapping based on Data partitioning
0
0 0 1 1 1 2 2
2
Applicable to dense matrix computations
0 1 2 0 1 2 0 1 2
Block distribution
Block-cyclic distribution
Graph partitioning based mapping
Applicable for sparse matrix computations
Mapping based on task partitioning
21
Based on Task Partitioning
Based on task dependency graph
0
0
4
0
0
2
1
2
4
3
4
6
5
6
7
In general the problem is NP complete
22
Mapping
Dynamic Mapping
A process/global memory can hold a set of
tasks
Distribute some tasks to all processes
Once a process completes its tasks, it asks
the coordinator process for more tasks
Referred to as self-scheduling, work-
stealing
23
High-level Goals
Table 2.1
Steps in the Parallelization Pr ocess and Their Goals
ArchitectureDependent?
Major Performance Goals
Decomposition
Mostly no
Expose enough concurr ency but not too much
Assignment
Mostly no
Balance workload
Reduce communication volume
Orchestration
Yes
Reduce noninher ent communication via data
locality
Reduce communication and synchr onization cost
as seen by the pr ocessor
Reduce serialization at shar ed r esour ces
Schedule tasks to satisfy dependences early
Mapping
Yes
Put r elated pr ocesses on the same pr ocessor if
necessary
Exploit locality in network topology
Step
24
PARALLEL ARCHITECTURE
25
Classification of Architectures –
Flynn’s classification
In terms of parallelism in
instruction and data stream
Single Instruction Single
Data (SISD): Serial
Computers
Single Instruction Multiple
Data (SIMD)
- Vector processors and
processor arrays
- Examples: CM-2, Cray-90,
Cray YMP, Hitachi 3600
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
26
Classification of Architectures –
Flynn’s classification
Multiple Instruction
Single Data (MISD): Not
popular
Multiple Instruction
Multiple Data (MIMD)
- Most popular
- IBM SP and most other
supercomputers,
clusters, computational
Grids etc.
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
27
Classification of Architectures –
Based on Memory
Shared memory
2 types – UMA and
NUMA
UMA
NUMA
Examples: HPExemplar, SGI Origin,
Sequent NUMA-Q
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
28
Shared Memory vs Message
Passing
Shared memory machine: The n
processors share physical address space
Communication can be done through this
shared memory
P
P
P
P
P
P
P
M
M
M
M
M
M
M
P
P
P
P Interconnect
P
P
P
Interconnect
Main Memory
The alternative is sometimes referred to
as a message passing machine or a
distributed memory machine
29
Shared Memory Machines
The shared memory could itself be
distributed among the processor nodes
Each processor might have some portion of
the shared physical address space that is
physically close to it and therefore
accessible in less time
Terms: NUMA vs UMA architecture
Non-Uniform Memory Access
Uniform Memory Access
30
Classification of Architectures –
Based on Memory
Distributed memory
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Recently multi-cores
Yet another classification – MPPs,
NOW (Berkeley), COW,
Computational Grids
31
Parallel Architecture:
Interconnection Networks
An interconnection network defined by
switches, links and interfaces
Switches – provide mapping between input and
output ports, buffering, routing etc.
Interfaces – connects nodes with network
32
Parallel Architecture:
Interconnections
Indirect interconnects: nodes are connected to
interconnection medium, not directly to each other
Shared bus, multiple bus, crossbar, MIN
Direct interconnects: nodes are connected directly
to each other
Topology: linear, ring, star, mesh, torus, hypercube
Routing techniques: how the route taken by the message
from source to destination is decided
Network topologies
Static – point-to-point communication links among
processing nodes
Dynamic – Communication links are formed dynamically by
switches
33
Interconnection Networks
Static
Bus – SGI challenge
Completely connected
Star
Linear array, Ring (1-D torus)
Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus
k-d mesh: d dimensions with k nodes in each dimension
Hypercubes – 2-logp mesh – e.g. many MIMD machines
Trees – our campus network
Dynamic – Communication links are formed dynamically by
switches
Crossbar – Cray X series – non-blocking network
Multistage – SP2 – blocking network.
For more details, and evaluation of topologies, refer to book by
Grama et al.
34
Indirect Interconnects
Shared
bus
Multiple
bus
2x2 crossbar
Crossbar switch
Multistage Interconnection
Network
35
Star
Direct Interconnect
Topologies
Ring
Linea
r
2D
Mesh
Hypercube(binary ncube)
n=
2
n=
3
Torus
36
Evaluating Interconnection
topologies
Diameter – maximum
distance between any two
1
processing
nodes
2
Full-connected
–
p/2
Star – logP
Ring –
Hypercube Connectivity – multiplicity of paths between 2 nodes.
Maximum number
of arcs to be removed from
1
network2 to break it into two disconnected networks
Linear-array
–
2
Ring –
4
2-d mesh –
d
2-d mesh with wraparound –
D-dimension hypercubes –
37
Evaluating Interconnection
topologies
bisection width – minimum number of
links to be removed from network to
2
partition
it into 2 equal halves
Root(P)
Ring –
1
P-node
2-D mesh Tree1 –
P2/4
Star –
CompletelyP/2connected –
Hypercubes -
38
Evaluating Interconnection
topologies
channel width – number of bits that can be
simultaneously communicated over a link, i.e.
number of physical wires between 2 nodes
channel rate – performance of a single physical
wire
channel bandwidth – channel rate times channel
width
bisection bandwidth – maximum volume of
communication between two halves of network,
i.e. bisection width times channel bandwidth
39
Shared Memory Architecture:
Caches
P2
Read X
P1
Read X
Write
X=1
Cache hit:
Wrong
data!!
X: 0
X:
X: 10
X: 1
0
40
Cache Coherence Problem
If each processor in a shared memory
multiple processor machine has a
data cache
Potential data consistency problem: the
cache coherence problem
Shared variable modification, private
cache
Objective: processes shouldn’t read
`stale’ data
Solutions
41
Cache Coherence Protocols
Write update – propagate cache line to
other processors on every write to a
processor
Write invalidate – each processor gets the
updated cache line whenever it reads stale
data
Which is better?
42
Invalidation Based Cache
Coherence
P2
Read X
P1
Read X
Write
X=1
X:
X: 10
X: 1
X: 0
Invalidat
e
X: 0X: 1
43
Cache Coherence using invalidate
protocols
3 states associated with data items
Shared – a variable shared by 2 caches
Invalid – another processor (say P0) has updated the data
item
Dirty – state of the data item in P0
Implementations
Snoopy
for bus based architectures
shared bus interconnect where all cache controllers monitor all bus
activity
There is only one operation through bus at a time; cache controllers
can be built to take corrective action and enforce coherence in caches
Memory operations are propagated over the bus and snooped
Directory-based
Instead of broadcasting memory operations to all processors,
propagate coherence operations to relevant processors
A central directory maintains states of cache blocks, associated
processors
Implemented with presence bits
44
END