PPP – Lecture 1 | slideum.com

PPP – Lecture 1

Transcript PPP – Lecture 1

Parallel Programming
Sathish S. Vadhiyar
Motivations of Parallel
Computing
Parallel Machine: a computer system with more than one
processor
Motivations
• Faster Execution time due to non-dependencies between
regions of code
• Presents a level of modularity
• Resource constraints. Large databases.
• Certain class of algorithms lend themselves
• Aggregate bandwidth to memory/disk. Increase in data
throughput.
•
Clock rate improvement in the past decade – 40%
•
Memory access time improvement in the past decade
– 10%
2
Parallel Programming and
Challenges
 Recall the advantages and motivation of
parallelism
 But parallel programs incur overheads
not seen in sequential programs
 Communication delay
 Idling
 Synchronization
3
Challenges
P0
P1
Idle time
Computation
Communication
Synchronization
4
How do we evaluate a parallel
program?
 Execution time, Tp
 Speedup, S
 S(p, n) = T(1, n) / T(p, n)
 Usually, S(p, n) < p
 Sometimes S(p, n) > p (superlinear speedup)
 Efficiency, E
 E(p, n) = S(p, n)/p
 Usually, E(p, n) < 1
 Sometimes, greater than 1
 Scalability – Limitations in parallel computing,
relation to n and p.
5
Speedups and efficiency
S
E
Ideal
p
Practical
p
6
Limitations on speedup –
Amdahl’s law
 Amdahl's law states that the performance
improvement to be gained from using some
faster mode of execution is limited by the
fraction of the time the faster mode can be
used.
 Overall speedup in terms of fractions of
computation time with and without
enhancement, % increase in enhancement.
 Places a limit on the speedup due to parallelism.
 Speedup = 1
(fs + (fp/P))
7
Amdahl’s law Illustration
S = 1 / (s + (1-s)/p)
1
Efficiency
0.8
0.6
0.4
0.2
Courtesy:
0
0
5
10
http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html
http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm
15
Amdahl’s law analysis
f
P=1
P=4
P=8
P=16
P=32
1.00
1.0
4.00
8.00
16.00
32.00
0.99
1.0
3.88
7.48
13.91
24.43
0.98
1.0
3.77
7.02
12.31
19.75
0.96
1.0
3.57
6.25
10.00
14.29
•For the same fraction, speedup numbers keep moving away from
processor size.
•Thus Amdahl’s law is a bit depressing for parallel programming.
•In practice, the number of parallel portions of work has to be large
enough to match a given number of processors.
Gustafson’s Law
 Amdahl’s law – keep the parallel work fixed
 Gustafson’s law – keep computation time
on parallel processors fixed, change the
problem size (fraction of parallel/sequential
work) to match the computation time
 For a particular number of processors, find
the problem size for which parallel time is
equal to the constant time
 For that problem size, find the sequential
time and the corresponding speedup
 Thus speedup is scaled or scaled speedup.
Also called weak speedup
Metrics (Contd..)
Table 5.1: Efficiency as a function of n and p.
N
64
192
512
P=1
1.0
1.0
1.0
P=4
0.80
0.92
0.97
P=8
0.57
0.80
0.91
P=16
0.33
0.60
0.80
P=32
11
Scalability
 Efficiency decreases with increasing P;
increases with increasing N
 How effectively the parallel algorithm can use
an increasing number of processors
 How the amount of computation performed
must scale with P to keep E constant
 This function of computation in terms of P is
called isoefficiency function.
 An algorithm with an isoefficiency function of
O(P) is highly scalable while an algorithm with
quadratic or exponential isoefficiency function
is poorly scalable
12
Parallel Program Models
 Single Program
Multiple Data (SPMD)
 Multiple Program
Multiple Data
(MPMD)
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
13
Programming Paradigms
 Shared memory model – Threads, OpenMP,
CUDA
 Message passing model – MPI
14
PARALLELIZATION
Parallelizing a Program
Given a sequential program/algorithm, how to
go about producing a parallel version
Four steps in program parallelization
1. Decomposition
Identifying parallel tasks with large extent of
possible concurrent activity; splitting the problem
into tasks
2. Assignment
Grouping the tasks into processes with best load
balancing
3. Orchestration
Reducing synchronization and communication costs
4. Mapping
Mapping of processes to processors (if possible)
16
Steps in Creating a Parallel
Program
Partitioning
D
e
c
o
m
p
o
s
i
t
i
o
n
Sequential
computation
A
s
s
i
g
n
m
e
n
t
Tasks
p0
p1
p2
p3
Processes
O
r
c
h
e
s
t
r
a
t
i
o
n
p0
p1
p2
p3
Parallel
program
M
a
p
p
i
n
g
P0
P1
P2
P3
Processors
17
Orchestration
Goals
Structuring communication
Synchronization
Challenges
Organizing data structures – packing
Small or large messages?
How to organize communication and
synchronization ?
18
Orchestration
Maximizing data locality
Minimizing volume of data exchange
Not communicating intermediate results – e.g. dot product
Minimizing frequency of interactions - packing
Minimizing contention and hot spots
Do not use the same communication pattern with the
other processes in all the processes
Overlapping computations with interactions
Split computations into phases: those that depend on
communicated data (type 1) and those that do not (type
2)
Initiate communication for type 1; During
communication, perform type 2
Replicating data or computations
Balancing the extra computation or storage cost with 19
the gain due to less communication
Mapping
Which process runs on which particular
processor?
Can depend on network topology, communication
pattern of processors
On processor speeds in case of heterogeneous
systems
20
Mapping
 Static mapping
 Mapping based on Data partitioning
0
0 0 1 1 1 2 2
2
 Applicable to dense matrix computations
0 1 2 0 1 2 0 1 2
 Block distribution
 Block-cyclic distribution
 Graph partitioning based mapping
 Applicable for sparse matrix computations
 Mapping based on task partitioning
21
Based on Task Partitioning
 Based on task dependency graph
0
0
4
0
0
2
1
2
4
3
4
6
5
6
7
 In general the problem is NP complete
22
Mapping
 Dynamic Mapping
 A process/global memory can hold a set of
tasks
 Distribute some tasks to all processes
 Once a process completes its tasks, it asks
the coordinator process for more tasks
 Referred to as self-scheduling, work-
stealing
23
High-level Goals
Table 2.1
Steps in the Parallelization Pr ocess and Their Goals
ArchitectureDependent?
Major Performance Goals
Decomposition
Mostly no
Expose enough concurr ency but not too much
Assignment
Mostly no
Balance workload
Reduce communication volume
Orchestration
Yes
Reduce noninher ent communication via data
locality
Reduce communication and synchr onization cost
as seen by the pr ocessor
Reduce serialization at shar ed r esour ces
Schedule tasks to satisfy dependences early
Mapping
Yes
Put r elated pr ocesses on the same pr ocessor if
necessary
Exploit locality in network topology
Step
24
PARALLEL ARCHITECTURE
25
Classification of Architectures –
Flynn’s classification
In terms of parallelism in
instruction and data stream
 Single Instruction Single
Data (SISD): Serial
Computers
 Single Instruction Multiple
Data (SIMD)
- Vector processors and
processor arrays
- Examples: CM-2, Cray-90,
Cray YMP, Hitachi 3600
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
26
Classification of Architectures –
Flynn’s classification
 Multiple Instruction
Single Data (MISD): Not
popular
 Multiple Instruction
Multiple Data (MIMD)
- Most popular
- IBM SP and most other
supercomputers,
clusters, computational
Grids etc.
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
27
Classification of Architectures –
Based on Memory
 Shared memory
 2 types – UMA and
NUMA
UMA
NUMA
Examples: HPExemplar, SGI Origin,
Sequent NUMA-Q
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
28
Shared Memory vs Message
Passing
 Shared memory machine: The n
processors share physical address space
 Communication can be done through this
shared memory
P
P
P
P
P
P
P
M
M
M
M
M
M
M
P
P
P
P Interconnect
P
P
P
Interconnect
Main Memory
 The alternative is sometimes referred to
as a message passing machine or a
distributed memory machine
29
Shared Memory Machines
The shared memory could itself be
distributed among the processor nodes
 Each processor might have some portion of
the shared physical address space that is
physically close to it and therefore
accessible in less time
 Terms: NUMA vs UMA architecture
 Non-Uniform Memory Access
 Uniform Memory Access
30
Classification of Architectures –
Based on Memory
 Distributed memory
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/


Recently multi-cores
Yet another classification – MPPs,
NOW (Berkeley), COW,
Computational Grids
31
Parallel Architecture:
Interconnection Networks
 An interconnection network defined by
switches, links and interfaces
 Switches – provide mapping between input and
output ports, buffering, routing etc.
 Interfaces – connects nodes with network
32
Parallel Architecture:
Interconnections
 Indirect interconnects: nodes are connected to
interconnection medium, not directly to each other
 Shared bus, multiple bus, crossbar, MIN
 Direct interconnects: nodes are connected directly
to each other
 Topology: linear, ring, star, mesh, torus, hypercube
 Routing techniques: how the route taken by the message
from source to destination is decided
 Network topologies
 Static – point-to-point communication links among
processing nodes
 Dynamic – Communication links are formed dynamically by
switches
33
Interconnection Networks
 Static








Bus – SGI challenge
Completely connected
Star
Linear array, Ring (1-D torus)
Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus
k-d mesh: d dimensions with k nodes in each dimension
Hypercubes – 2-logp mesh – e.g. many MIMD machines
Trees – our campus network
 Dynamic – Communication links are formed dynamically by
switches


Crossbar – Cray X series – non-blocking network
Multistage – SP2 – blocking network.
 For more details, and evaluation of topologies, refer to book by
Grama et al.
34
Indirect Interconnects
Shared
bus
Multiple
bus
2x2 crossbar
Crossbar switch
Multistage Interconnection
Network
35
Star
Direct Interconnect
Topologies
Ring
Linea
r
2D
Mesh
Hypercube(binary ncube)
n=
2
n=
3
Torus
36
Evaluating Interconnection
topologies
 Diameter – maximum
distance between any two
1
processing
nodes
2
 Full-connected
–
p/2
 Star – logP
 Ring –
 Hypercube  Connectivity – multiplicity of paths between 2 nodes.
Maximum number
of arcs to be removed from
1
network2 to break it into two disconnected networks
 Linear-array
–
2
 Ring –
4
 2-d mesh –
d
 2-d mesh with wraparound –
 D-dimension hypercubes –
37
Evaluating Interconnection
topologies
 bisection width – minimum number of
links to be removed from network to
2
partition
it into 2 equal halves






Root(P)
Ring –
1
P-node
2-D mesh Tree1 –
P2/4
Star –
CompletelyP/2connected –
Hypercubes -
38
Evaluating Interconnection
topologies
 channel width – number of bits that can be
simultaneously communicated over a link, i.e.
number of physical wires between 2 nodes
 channel rate – performance of a single physical
wire
 channel bandwidth – channel rate times channel
width
 bisection bandwidth – maximum volume of
communication between two halves of network,
i.e. bisection width times channel bandwidth
39
Shared Memory Architecture:
Caches
P2
Read X
P1
Read X
Write
X=1
Cache hit:
Wrong
data!!
X: 0
X:
X: 10
X: 1
0
40
Cache Coherence Problem
 If each processor in a shared memory
multiple processor machine has a
data cache
 Potential data consistency problem: the
cache coherence problem
 Shared variable modification, private
cache
 Objective: processes shouldn’t read
`stale’ data
 Solutions
41
Cache Coherence Protocols
 Write update – propagate cache line to
other processors on every write to a
processor
 Write invalidate – each processor gets the
updated cache line whenever it reads stale
data
 Which is better?
42
Invalidation Based Cache
Coherence
P2
Read X
P1
Read X
Write
X=1
X:
X: 10
X: 1
X: 0
Invalidat
e
X: 0X: 1
43
Cache Coherence using invalidate
protocols
 3 states associated with data items
 Shared – a variable shared by 2 caches
 Invalid – another processor (say P0) has updated the data
item
 Dirty – state of the data item in P0
 Implementations
 Snoopy




for bus based architectures
shared bus interconnect where all cache controllers monitor all bus
activity
There is only one operation through bus at a time; cache controllers
can be built to take corrective action and enforce coherence in caches
Memory operations are propagated over the bus and snooped
 Directory-based



Instead of broadcasting memory operations to all processors,
propagate coherence operations to relevant processors
A central directory maintains states of cache blocks, associated
processors
Implemented with presence bits
44
 END