Introduction to Network Design

Download Report

Transcript Introduction to Network Design

Implications for Parallel Software and
Synchronization
Message-passing comparison
Machine
Year
Max BW
(MB/s)
T0(ms)
Cycles/T0
MFLOPS
/processor
FPops/T0
n1/2
iPSC2
1988
700
2.7
5,600
1
700
1,400
CM-5
1992
95
8
3,135
20
1,900
760
SP-1
1993
50
25
2,500
100
5,000
1,250
Paragon
1994
30
175
1,500
50
1,500
7,240
SP-2
1994
35
40
3,960
200
7,000
2,400
T3D-PVM
1994
21
27
3,150
94
1,974
1,502
NOW
1996
16
38
2,672
180
2,880
4,200
Challenge
1995
10
64
900
308
3,080
800
Sun
E6000
1996
11
160
1,760
180
1,980
2,100
2
Message-Passing Time vs Size
1,000,000
100,000
Time (m s)
10,000
U
iPSC/860
n
IBM SP-2
l
Meiko CS-2
6
Paragon/Sunmos*
H
Cray T3D
z
NOW
s
SGI Challenge
u
Sun E5000
U
U
U U
U
1,000
U
U U
100 l
U
n
6
H
z
10 su
1
1
l
U
n
6
Hz
u
s
l
U
n
6
Hz
u
s
10
l
U
n
6
Hz
u
s
U
l
n
6
Hz
u
s
l
n
6
Hz
l
n
6z
H
su su
100
U
l
n
z
H
6
su
U
l
n
z
H
6
su
n
z
l
H
su
6
z
n
l
H
s
u
6
1,000
Message size
n
H
n
z
l
s
u
6
H
n
z
l
s
u
6
10,000
H
n
z
l
s
u
6
100,000
H
n
z
l
s
u
6
1,000,000
*Sunmos operating system
is used for the benchmark.
One-way communication time
n
point-to-point
3
Message-Passing Bandwidth vs Size
180
U
160
140
Bandwidth (MB/s)
120
100
6
iPSC/860
6
n
IBM SP-2
l
Meiko CS-2
6
Paragon/Sunmos
H
Cray T3D
z
NOW
s
SGI Challenge
u
Sun E6000
u
6
u
6
u
u
6
us
40
s
su
20
0 su
T
n
l
6
H
z
1
sH
u
n
6
z
T
l
su
H
z
6
H
z su
n
n
6
T
l
T
l
10
su
z
H
6
n
l
T
su
H
z
6
n
l
T
100
Bandwidth
n
u
80
60
n
u
6
su
H
6z
n
l
T
6
H
z
n
l
T
6 H
H
l
z
zn
n
l
T T
s
s
s
l
n
z
l l
z
n
z n
H H
T
T T
H
1,000
10,000
s
s
l
z
n
H
l
z
n
H
T
T
100,000
1,000,000
Message size
pairwise bandwidth
4
Scalable Synchronization Operations
n
Message-Passing:
n
n
n
Point-to-point synchronization
Mutual exclusion
Global synchronization
n
n
From point-to-point messages
Shared Address Space
n
n
n
Mutual exclusion, Barrier
Point-to-point
Recall: sophisticated locks reduced contention by spinning on separate
locations
n
n
caching brought them local
test&test&set, ticket-lock, array lock
n
n
n
O(p) space
Problem: with array lock location determined by arrival order => not
likely to be local
Solution: queue-lock
n
build distributed linked-list, each spins on local node
5
Queue Locks
(a)
(b)
L
A
(c)
A
A
L
B
B
(d)
C
(e)
L
B
C
C
n
n
n
L
Head holds lock; Each points to next waiter
Shared pointer to tail
Acquire
n
n
L
swap (fetch&store) tail with node address, chain in prev
Release
n
n
signal next
compare&swap plus check to reset tail
6
Barriers
n
Special Hardware
n
CM-5 : a special “control” network
n
n
n
barriers, reductions, broadcasts, and other global ops
CRAY T3D : hardware support for barriers
Software Algorithms
n
Software combining trees
n
n
Software combining barrier with sense reversal
Barrier that spins on local variables only
Contention
Flat
Little contention
Tree structured
7
Parallel Prefix: Upward Sweep
F-0
7-0
7-0
F-8
B-8
3-0
B-8
F-C
D-C
9-8
D-C
F-E
C
E
F
n
n
n
E
D
C
B
5-4
B-A
9-8
7-6
A
8
6
A
3-0
7-4
9
8
7
1-0
5-4
4
6
5
4
3
3-2
1-0
2
0
2
1
0
generalization of barrier (reduce-broadcast)
compute S i = X i + X i-1 + ... + X0, for i = 0, 1, ...
combine children, store least significant
8
Downward Sweep of parallel Prefix
7-0
7-0
B-8
3-0
B-0
D-C
9-8
D-0
B-0
E
n
n
3-0
7-0
C
E-0
D-0
C-0
F
E
D
5-4
9-0
7-0
A
B-0
C
5-0
8
1-0
3-0
6
1-0
4
2
A-0
9-0
8-0
7-0
6-0
5-0
4-0
3-0
B
A
9
8
7
6
5
4
0
2-0
1-0
0
3
2
1
0
Least branch send to most sig child
when receive from above
n
n
send to least significant
combine with stored and send result to most sign
9
Scalable Interconnection Network Design
Scalable, High Performance Network
n
n
At Core of Parallel Computer Arch.
Requirements and trade-offs at many levels
n
n
n
n
n
Elegant mathematical structure
Deep relationships to algorithm structure
Managing many traffic flows
Electrical / Optical link properties
Little consensus
n
n
n
n
Scalable
Interconnection
Network
interactions across levels
Performance metrics?
Cost metrics?
Workload?
need holistic
understanding
network
interface
CA
M
CA
P
M
P
11
Requirements from Above
Communication-to-computation ratio
 bandwidth that must be sustained for given
computational rate
n
n
n
n
traffic localized or dispersed?
burst or uniform?
Programming Model
n
n
n
protocol
granularity of transfer
degree of overlap (slackness)
job of a parallel machine network is to transfer
information from source node to dest. node in support of
network transactions that realize the programming model
12
Goals
n
n
latency as small as possible
as many concurrent transfers as possible
n
n
n
operation bandwidth
data bandwidth
cost as low as possible
13
Outline
n
n
n
n
Introduction
Basic concepts, definitions, performance perspective
Organizational structure
Topologies
14
Basic Definitions
n
n
Network interface
Links
n
n
bundle of wires or fibers that carries a signal
Switches
n
connects fixed number of input channels to fixed number of output
channels
15
Links and Channels
...ABC123 =>
...QR67 =>
Transmitter
n
n
transmitter converts stream of digital symbols into signal that is
driven down the link
receiver converts it back
n
n
n
n
Receiver
tran/rcv share physical protocol
trans + link + rcv form Channel for digital info flow between
switches
link-level protocol segments stream of symbols into larger units:
packets or messages (framing)
node-level protocol embeds commands for dest communication
assist within packet
16
Formalism
n
n
network is a graph V = {switches and nodes} connected by
communication channels C  V x V
Channel has width w and signaling rate f = 1/t
n
n
n
n
n
channel bandwidth b = wf
phit (physical unit) data transferred per cycle
flit - basic unit of flow-control
Number of input (output) channels is switch degree
Sequence of switches and links followed by a message is
a route
17
What characterizes a network?
n
Topology
n
n
n
n
Routing Algorithm
n
n
n
restricts the set of paths that msgs may follow
many algorithms with different properties
Switching Strategy
n
n
n
physical interconnection structure of the network graph
direct: node connected to every switch
indirect: nodes connected to specific subset of switches
how data in a msg traverses a route
circuit switching vs. packet switching
Flow Control Mechanism
n
n
n
when a msg or portions of it traverse a route
what happens when traffic is encountered?
Store-and-forward routing, Wormhole routing (cut-through)
18
What determines performance ?
n
Interplay of all of these aspects of the design
19
Topological Properties
n
n
n
n
Routing Distance - number of links on route
Diameter - maximum routing distance
Average Distance
A network is partitioned by a set of links if their removal
disconnects the graph
20
Typical Packet Format
Routing
and
Control
Header
Data
Payload
Error
Code
Trailer
digital
symbol
Sequence of symbols transmitted over a channel
n
Two basic mechanisms for abstraction
n
n
encapsulation: carrying higher-level protocol info in an
uninterpreted form with in the message format
fragmentation : splitting the higher-level protocol info into a
sequence of messages
21
Communication Performance: Latency
n
Time(n)s-d = overhead + routing delay + channel
occupancy + contention delay
n
occupancy = (n + ne) / b
n
ne : the size of the envelope
n
Routing delay?
n
Contention?
22
Store&Forward vs Cut-Through Routing
Cut-Through Routing
Store & Forward Routing
Source
Dest
32 1 0
3 2 1 0
3 2 1
3 2
3
Dest
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
0
3 2 1
0
3 2
1
0
3
2
1
0
3
2
1 0
3
2 1 0
3 2 1 0
1 0
2 1 0
3 2 1 0
3 2 1
Time
Routing distance
Routing delay
h(n/b + D)
n
0
vs
n/b + h D
wormhole vs virtual cut-through
23
Contention
SF
Circuit Switching
Cut-through
• Virtual Channel
• Wormhole
n
Two packets trying to use the same link at same time
n
n
n
n
Most parallel mach. networks block in place
n
n
n
limited buffering
drop?
block in place
link-level flow control
tree saturation
Closed system - offered load depends on delivered
24
Bandwidth
n
What affects local bandwidth? (Individual nodes)
n
n
n
packet density
routing delay
contention
n
n
n
b x n/(n + ne)
b x n / (n + ne + wD)
endpoints
within the network
Aggregate bandwidth
n
bisection bandwidth
n
n
n
sum of bandwidth of smallest set of links that partition the network
total bandwidth of all the channels: C x b bytes
suppose N hosts issue packet every M cycles with average dist
n
n
n
n
each msg occupies h channels for l = n/w cycles each
total load on the net : Nhl / M phits per cycle
C/N channels available per node
link utilization r = Nhl /MC< 1
25
Saturation
0.8
Delivered Bandwidth
80
70
Latency
60
50
40
Saturation
30
20
10
0.7
0.6
0.5
0.4
Saturation
0.3
0.2
0.1
0
0
0
0.2
0.4
0.6
0.8
Delivered Bandwidth
1
0
0.2
0.4
0.6
0.8
1
1.2
Offered Bandwidth
26