Simulating a CRCW algorithm with an EREW algorithm

Download Report

Transcript Simulating a CRCW algorithm with an EREW algorithm

Communication operations
Efficient Parallel Algorithms
COMP308
Communication time
Communication requires 3 costs:
1. Static start up time (ts):
– It is the time required to handle a message at the
sending processor
2. Per-hop time (th) with l the #Links that the
message passes
– It is take a finite amount of time to reach the next
processor in its path after a message leaves a processor.
3. Per-word transfer time (tw): with m the
#bytes
– If the channel bandwidth is r words per second, then
each word takes time tw=1/r to traverse the link.
There are 2 main communication
schemes:
“store and forward” vs “cut-through”
In “store and forward” routing, when a
message is traversing a path with multiple
links, each intermediate node on the path
forwards the message to the next node after it
has received.
 In “cut-through” routing an intermediate
nodes does not wait for the entire message to
arrive before forwarding it.

– A tracer is first sent from the source to the
designation node to establish a connection.
– Once a connection has been established, the flits
are sent one after the other. All flits follows the
same path in a dovetailed fashion.
– As soon as a flit is received at an intermediate
node, the flit is passed on to the next node.
One to All Broadcast

Initially, only the source processor has the
data of size m that need to be broadcast. At
the end of the termination of the procedure,
there are P copies of the initial data, one
residing at each processor.
Broadcast on ring (Store and Forward)
If the sender sends the messages consecutively to the p-1
other processors, it takes p-1 steps.
By optimisation, we can reduce this to p/2 steps.
Eg.: a 8-processor ring requires 4 steps
NS diagram for “broadcast on ring”
Ring network, Cut-Through routing

With cut-through routing, messages can be sent faster to
nodes that are multiple hops away in the network. By using
this, we send the message first to the outermost node.
In general, in a p-processor ring the source processor first
sends the data to the processor at distance p/2, then both
processors sends the message to the processors at distance of
p/4 in the same direction, then to p/8, etc.
Broadcast on mesh (Store and Forward)
Most of the optimised
communication algorithms on
a mesh are simple extensions
of their ring counterparts, by
consecutively applying the
ring algorithm on each
dimension of the mesh.
Hypercube

The regular binary structure of the
hypercube plays an important role in
optimising communication.
 Here, a broadcast is performed by sending
the message along each dimension at each
step. This results in log p or d steps.
 It can be proved easily that log p is the
minimal number of steps for every network.
Hypercube

Important properties of the networks:
– Small degree,
– Small diameter,
– Regular recursive structure,
– Easy way to embed trees, etc

Hypercube – two nodes connected if
they are differ precisely on one bit
Hypercube – two nodes connected if they are
differ precisely on one bit
0
1
000
00
01
001
100
010
10
11
0000
0001
0100
0010
011
110
111
1000
0101
0011
0110
101
1001
1100
1010
0111
1101
1011
1110
1111
0000
0001
1000
001
0100
1100
0010
1010
1101
0101
011
0011
1110
0110
1111
0111
Broadcast on hypercube (S&F)
Broadcast on ring (Cut-Through )
Broadcast on mesh (C-T)
Broadcast on binary tree (C-T)
Gossiping
All-to-All Communication
Gossiping on Ring (Store and Forward)
Gossiping on Mesh (Store and Forward)
Gossiping on Hypercube (S&F)
Gossiping on Ring (and Mesh)
Cut-Through Routing



Each process sends m(p-1) words of data because it has an mword packet for every other processor
p 1
i p

i 1

The average distance that an m word packet travels is
p 1
2
Since there are p processors, each performing the same type of
communication, the total traffic on the network is
p
m( p  1)   p
2

The total number of communication channels in the network to
p2
share this load is p.
m( p  1) 
p
2  m( p  1)  p  t  m( p  1)  p
w
2
2
Hence this procedure cannot be improved by using CT routing
Gossiping on
Hypercube
(CT routing)