ENGR 9861 – High Performance Computer Architecture Winter 2007 Interconnection Networks (prepared by Jason Rhinelander in 2005) Winter 2007

Download Report

Transcript ENGR 9861 – High Performance Computer Architecture Winter 2007 Interconnection Networks (prepared by Jason Rhinelander in 2005) Winter 2007

ENGR 9861 – High Performance
Computer Architecture
Winter 2007
Interconnection Networks
(prepared by Jason Rhinelander in 2005)
Winter 2007
1
Introduction



When considering interconnection networks
for parallel computation there are many
shared concepts with LANs (Local Area
Networks) and WANs (Wide Area Networks).
Interconnection Networks for parallel
computing is a wide and interesting field.
There are many areas of theoretical and
practical research with regards to this topic.
“Parallel Computer Architecture”;Ch10, Culler,
Singh.
Winter 2007
2
Introduction

There are different ways at looking at this
topic:



Firstly, the interconnection structure often has
mathematical properties that often reflect the
communication patterns of important algorithms
(a regular structure).
Secondly, the design of the physical link between
asynchronous elements is a huge area of
research.
Thirdly, competition between shared resources
within a network is also a large area of research.
Winter 2007
3
Basic Definitions Ch 10.1
Communication
Assist
Winter 2007
4
Basic Definitions


Some terms:
 CA: Communications Assist
 NI: Network Interface
 Mem: Memory
 P: Processor
Communication Requirements for this generic view:
 The interconnection network will need to provide
network transactions that support the
programming model.
 Latency should be minimized.
 Adequate concurrent transactions must be
supported.
Winter 2007
5
Basic Definitions



Physical Protocol: Converts analog
signals into digital ones.
Link Protocol: is responsible for
grouping symbols into packets.
Node Level Protocol: is responsible for
attaching information so that the target
CA can accomplish the transfer.
Winter 2007
6
Basic Definitions


We can look at an IN as a graph that
contains vertices (processing hosts or
switch elements) and channels between
vertices.
Channels have the following properties:


Width w (in bits)
Signaling Rate f = 1 / T
Winter 2007
7
Basic Definitions



Channel Bandwidth b = w * f
The amount of data transferred across
a link in one cycle is called a physical
unit or phit.
Switches connect input channels to
output channels. The number of
channels connected is called the switch
degree.
Winter 2007
8
Network Components

The following components make up a
network:

Topology: The structure of the network
graph. 2D grid, 3D cube, irregular etc.


A direct network has a host (processing
element) connected to each switch.
An indirect network will have hosts connected
to a subset of available switches. The hosts will
then be on the edge of the network graph.
Winter 2007
9
Network Components


Routing Algorithm: The path that messages make
through the network is called a route. Procedure
describing which route each message takes is
called the routing algorithm.
Switching Strategy: How a message travels its
route.


Circuit Switching: The same route is used until the entire
message is transferred by establishing an end-to-end
connection. The route can be reversed as well.
Packet Switching: Message is broken into packets with
its own routing information. Each packet can be
individually routed. Requires routing tag overhead.
Winter 2007
10
Network Components

Flow Control Mechanism: Controls when a
message or parts of it, move along its
route. Flow control becomes a necessity
when a network resource has to be utilized
by multiple messages at the same time.

Flow Control Options:




Stalled in place.
Buffered.
Re-routed.
Discarded.
Winter 2007
11
Network Components




The largest unit of information that can be accepted or
rejected by the nodes in a network is called a flit.
How big can a flit be?
ANS: It can be as big as the entire message or packet. It
can be as small as a phit.
Some other terms that can be used when talking of
networks for parallel processing are:



Diameter: the maximum length of the shortest path
between two nodes through a network.
Routing Distance: number of links between source
and destination.
Average Distance: average of routing distance.
Winter 2007
12
Packet Formatting
Winter 2007
13
Packet Format



Header: contains routing and control
info so that the switches can interpret
what to do when the packet arrives.
Payload: The information contained
within the packet.
Trailer: Usually contains an error
checking code.
Winter 2007
14
Packet Format

In parallel processing networks, like LANs,
WANs and the Internet we have issues of
Encapsulation and Fragmentation.
 Encapsulation involves carrying info from a
higher level of abstraction within the
current layer.
 Fragmentation involves splitting up the
higher level information into a sequence of
messages.
Winter 2007
15
Communication Performance


There are four components that affect
the time to transfer n bits from source
to destination:
TimeS-D(n) = Overhead + Routing Delay
+ Channel Occupancy + Contention
Delay
Winter 2007
16
Performance


Overhead: comes from getting the message in and
out of the network (ie can be caused by the CA)
Channel Occupancy: gives us a lower bound on
latency. Channel occupancy could be simply viewed
as the time taken for the message to get from source
to destination on a direct link.



The CA takes time to process the communication request.
Each channel traveled by the packet encounters delay.
The destination CA takes some time to process the packet.
Winter 2007
17
Performance

Overall the occupancy of the channel can
be determined by:
(n + ne) / b




n = number if bits in payload
ne = number of bits in header and trailer
b = channel bandwidth
We can also look at the effective
bandwidth:
n / ( n + ne)
Winter 2007
18
Performance

Routing Delay: Each channel in the route
incurs a little delay that builds up (we will consider
the time taken for node to switch interface to be part of the
routing delay).

Causes of routing delay:



Routing distance (h): number of channels used in route.
Switching delay (Δ): time taken for a switch to select the
proper output port.
h depends on network topology, routing algorithm
used and specific nodes involved in the
transaction.
Winter 2007
19
Unloaded Latency

(based on switching strategy)
Store and Forward Routing


(packet switched):
The entire packet is received by the switch before
forwarded to the next channel.
Latency:
Number of bits
in packet, including header
and trailer.
Winter 2007
20
Unloaded Latency

(based on switching strategy)
Circuit switched

Once the route is setup it is maintained.
We therefore only encounter the switching
latency when the route is setup.
Winter 2007
21
Unloaded Latency

In the case of circuit switching we can
note the following:


(based on switching strategy)
As the message size increases, the amount
of latency caused by route setup per hop
(h * Δ), and hence the topology becomes
insignificant.
How can we reduce the latency in the
case of store and forward packet
switching?
Winter 2007
22
Unloaded Latency

(based on switching strategy)
Solution:

Fragment the message packet into smaller
packets. The smaller packets flow through in a
pipelined fashion. The unloaded latency becomes:
Size of fragments.
Same form as before.
Winter 2007
23
Unloaded Latency


(based on switching strategy)
The previous example is commonly used in
the internet and larger networks.
In the case of parallel processing cut-through
routing can be used:

Once a few phits are received by the switch the
rest of packet is routed to the output.
This value
will be different
from the circuit
switched case.
Winter 2007
24
Unloaded Latency
Winter 2007
(Based on Switching Strategy)
25
Contention


As in traditional networking, contention will
occur then two incoming messages need to
be routed to the same output at the same
time.
In store and forward routing the switch will
buffer an entire packet. If there is contention,
one packet will get switched to the output
and one will get blocked until the next
switching cycle.
Winter 2007
26
Contention


In circuit switching, usually some type of
probe is sent from the source node to the
destination. If there is contention, the probe
will be resent after some time later.
Cut through routing can handle contention in
two ways:

Virtual cut-through, route one of the packets into
a buffer then route in the next switch cycle. This
has the same penalty as store and forward routing
under contention.
Winter 2007
27
Contention

Wormhole , only a few flits are buffered
from the header of the packet, then the tail
portion is maintained. Similar to holding
the circuit open from the sender’s point of
view.
Winter 2007
28
Bandwidth

By just looking at the bandwidth from the viewpoint
of a single node, the channel has a channel
bandwidth that is higher than the bandwidth that the
node can send useful data on.
beff =
Winter 2007
29
Bandwidth

Taking into account routing delay at the
switch (Δ) we have the following
expression:
w is included here in case
the channel width in more
than a bit. Note that n will
be the size in phits.
beff =
Winter 2007
30
Bandwidth


These expressions are useful for looking at
the bandwidth available to one node.
What if we want a measure of the overall
bandwidth in a network?

Most common measure is the bisection
bandwidth:


The sum of the bandwidths of the minimum set of
channels that, if removed, partition the network into two
equal unconnected sets of nodes.
This has a nice property when considering a
uniform communication pattern, what is it?
Winter 2007
31
Bandwidth

ANSWER:


Half of the messages are expected to cross the
bisection in each direction.
With this in mind what is wrong with this
notion of global or “aggregate” bandwidth
available?

ANSWER:

If communication is localized, then the bisection
bandwidth will give a lower value for communication
time.
Winter 2007
32
Total Bandwidth and Average Link Utilization



Total Bandwidth = C * b (bytes/sec)
= C * w (bits / cycle)
= C (phits / cycle)
Assuming each of N hosts issue a packet
every M cycles with average routing distance
h. Then each packet occupies h channels for l
= n / w cycles.
The total load on the network is
(N * h * l ) / M) phits/cycle
Winter 2007
33
Total Bandwidth and Average Link Utilization

The average link utilization is (<1):

This is discussed on P. 762 of the text.
Winter 2007
34
Bandwidth


The number of links or channels per
node (C /N) is the total communication
bandwidth (phits/cycle/node).
This is consumed in direct proportion to
the message size and to the routing
distance.
Winter 2007
35
Factors That Limit ρ


Before we look at why the link
utilization is less than one (in some
cases much less than one) let us
consider the various properties on the
network.
The number of links per network node
is a property of topology.
Winter 2007
36
Factors That Limit ρ

Average routing distance depends on:





The topology
Routing algorithm
Program communication pattern
Mapping of program onto machine
Often good communication locality will
provide a small h, random communication will
give the average routing distance and a bad
pattern will cross the entire diameter.
Winter 2007
37
Factors That Limit ρ

Factors:




Communication may not be balanced over all
links.
Even if it is balanced, the routing algorithm may
not support the communication pattern of the
program.
Contention for other networking resources may
arise.
These factors affect the saturation point of
the network.
Winter 2007
38
What assumption is being
made here?
Winter 2007
39
Topology of INs

Before we discuss some different types
of interconnection network topologies
we want to consider the following:


The number of host nodes that is
connected to the network will be defined to
be N.
Characteristics of each topology will be
discussed as a function of N.
Winter 2007
40
Fully Connected Network




This type of network connects all inputs to all
outputs. It can be considered a single big
switch.
The diameter of such a network is: 1.
The degree in N.
Unfortunately, if there is a hardware failure in
such a network the entire network goes
down, or at least full connectivity is lost.
Winter 2007
41
Fully Connected Network

A bus is an example of a fully
connected network.


Its cost scales with O(N)
Bandwidth:



Total Bandwidth = O(1)
Bisection Bandwidth = O(1)
Bandwidth scaling is worse than O(1) as
clock rate reduces with #ports due to RC
Winter 2007
42
Fully Connected Networks



A crossbar switch is another example.
Bandwidth is O(N)
Cost is O(N2), why?


As more inputs/outputs are added, the total
number of cross points grow by N2.
The scalability of fully connected networks is
bad for large host sizes. Usually smaller
components of the network (like a basic
switching element) may be fully connected.
Winter 2007
43
Linear Arrays

Linear Array



Assume we have N (0 ..N-1) nodes
assembled in a linear fashion.
Assume each node is connected with a bidirectional link.
What is the diameter?



ANS: N – 1
Average routing distance ~ 2/3 N.
The bisection is one link.
Winter 2007
44
Linear Arrays




The route from node A to node B can be
described by the operation B-A.
This result is termed as the relative address.
Provides a log N – bit number with positive
numbering being away from node 0.
This arrangement provides no fault tolerance.
Winter 2007
45
Ring Bi-directional Links






Easily constructed by connecting the ends of
a linear array together.
Degree: 2
The diameter is N/2
The bisection of the network is 2
The average routing distance is N/3
Note there are two relative addresses
because we can travel in either direction. Also
provides better fault tolerance.
Winter 2007
46
Ring Unidirectional Links

If we have a ring the can only transmit
in one direction we have the following
properties:




Diameter N – 1
Average Distance is N/2
Relative Address ( B – A ) mod N
Bisection width: 1
Winter 2007
47
Winter 2007
48
Higher Dimension Meshes and Tori



A d-dimensional array consists of the
following elements { kd-1 x kd-2 … x k0 }
Where k is a vector of elements.
If 0<= ij <= kj-1 for 0<= j <= d-1 we
can use a vector to locate any node in
the mesh, i.e., the coordinates of a
node are comprised of <id-1, id-2, .. i0>
Winter 2007
49
Higher Dimension Meshes and Tori



Assuming the length along each dimension is
equal. N = kd
The degree of each node varies between 2d
and d. Nodes on the inside have the
maximum degree and nodes on the corners
have the smallest.
For example for d = 3, nodes on the corners
have 3 links or channels, and nodes on the
inside have 6 channels. What about tori?
Winter 2007
50
Higher Dimension Meshes and Tori




These arrays are called d-dimensional kary arrays.
To extend to a torus, the edges are
simply connected to the opposite side.
Usually these types of structures are
direct networks, meaning that every
node contains a processing element.
The network will scale by increasing k.
Winter 2007
51
Higher Dimension Arrays and Tori

We can form a relative address by simply
performing vector subtraction (unidirectional
case):




R = (bd-1 - ad-1 , bd-2 - ad-2 , … b0 - a0)
Actual routing can be performed in any order.
The diameter is simply d*(k-1)
If k is even, the bisection of a d-dimensional
k-ary structure will be kd-1. If k is odd it
maybe a little larger.
Winter 2007
52
Higher Dimension Arrays and Tori



The average distance is the average
distance in each dimension.
Therefore average h = d * 2/3 * k
roughly.
Spatially these networks scale in size to
whatever dimension we have. Volume in
d=3 and planar space in d=2. Assuming
shortest possible wiring.
Winter 2007
53
Higher Dimension Arrays and Tori
Winter 2007
54
Higher Dimension Arrays and Tori
Winter 2007
55
Trees



With meshes, the average routing distance grows
with logdN .
A binary tree has a degree of 3 (three connections
per node)
Usually trees are used in indirect networks. Indirect
Case:



Addressing to the leaves can be taken as a log2 N bit vector.
This gives the path from the root to the host. 0 = left, 1 =
right.
The diameter is 2 * log2 N
Average routing distance is almost as large as the diameter
of the network.
Winter 2007
56
Trees

Relative addressing can be
accomplished by doing the bit-wise XOR
operation.

For example to get the relative address
from A to B. R = A XOR B. The position of
the most significant 1 is how many levels
we go up. Then we use the lower bits of B
to get to B. We may not have to go all the
way to the root!
Winter 2007
57
Trees
A = 0001
B = 0101
A XOR B = 0100
Winter 2007
58
Trees


We can have trees of higher order
called k-ary trees.
We can also have fat trees.


More bandwidth is assigned to more
important links as we go towards the root.
A big problem with trees is that the root
is composed of one link, therefore the
bisection is one link.
Winter 2007
59
Butterflies



The construction of a butterfly is similar
to that of a tree.
We have many roots in a butterfly.
In addition, many parallel algorithms
communicate in a butterfly structure,
ex: Fast Fourier Transform and Batcher
odd-even merge sort.
Winter 2007
60
Butterflies



As a building block we start with 2 x 2 switch
elements.
The basic building block is setup so that
addressing can occur. A bit of 0 causes a
straight edge to be followed. While a 1 will
cause a crossover to occur.
In the case of a unidirectional indirect
butterfly with N hosts, the bisection is N/2
links.
Winter 2007
61
Butterflies
Winter 2007
62
Butterflies



When considering scalability, butterflies
can be better than meshes and trees in
the there are a total of N log2N (in the
case of the previous figure) links with
packets crossing log2N links on average
Therefore on average there shouldn’t
be any collisions.
How many links are in the bisection?
Winter 2007
63
Butterflies
Winter 2007
64
Hypercubes




If we take the original butterfly and collapse
each straight column into a single log2N
switch. This is close to a hypercube
arrangement.
We can cross dimensions of the hypercube to
get from source to destination.
Each node of a hypercube can embed a lower
dimension mesh.
Text P778.
Winter 2007
65
Hypercubes
From: http://linux.cs.sonoma.edu/~ravi/ces516sp04/Lectures/feb18.ppt
Winter 2007
66
Some Example Architectures
Winter 2007
67
Winter 2007
68
Routing


Routing from source to destination is of
primary importance to parallel
computing.
We have already seen some examples
of how a relative address is formed. In
the case of a d=3 cube the relative
address will give the shortest path in all
three dimensions.
Winter 2007
69
Routing


The routing algorithm decides at each
switch element, which output port to
place the packet onto.
3 ways to determine output port based
on packet header:



Arithmetic
Source based port select
Table Lookup
Winter 2007
70
Routing

Arithmetic

2D Mesh:


Each relative address contains the length to be
traveled in both the x and y directions [Δx, Δy]
At switch i,j we perform the following routing:
Winter 2007
71
Routing


The switch will look at the routing info in the
packet and modify the distance in the
appropriate direction. Dimension order
routing takes each dimension in turn.
Source based routing can also be used in
which the source node assigns switch port
numbers to the header.


Simple from the switch side.
May have variable header size and maybe large.
Winter 2007
72
Routing

Table Driven Routing:


Similar the the internet and WANs.
Switches will have a table of information
use for routing. The header contains an
index that is used in the table to select the
proper output port.



Tables must be updated.
Switch specific messages.
The table must be established in the first place.
Winter 2007
73
Routing

Deterministic Routing

Route is determined solely on the source
and destination. The status of the network
is not considered.



Dimension ordered routing is such an example.
In the case of a 2D mesh how else can we
route a packet?
ANS: If one dimension was blocked, we
could switch to the other dimension etc…
Winter 2007
74
Routing

Adaptive:


The route of the packet is determined by
source and destination, but may be
influenced by network conditions.
In the previous case we could zig-zag
across the mesh if both dimensions on the
edges were congested.
Winter 2007
75
Deadlock



Deadlock: occurs when a packet waits
for an event that cannot occur.
Live lock: Occurs when the routing of a
packet never arrives at its destination.
Indefinite Postponement: Occurs when
the packet waits for an event that never
happens.
Winter 2007
76
Deadlock Example
Winter 2007
77
Virtual Channels



One way of avoiding such deadlock is to
implement virtual channels.
Virtual channels are used in wormhole routing
and involve each physical channel to have
multiple buffers.
Assume we have 2 virtual channels in the
prevoius example. Say packets at a node
higher than their destination are placed in the
high channel and the opposite for lower
destinations.
Winter 2007
78
Virtual Channels
Winter 2007
79
Adaptive Routing
Winter 2007
80
Other Topics In Interconnection
Networks




Turn-Model Routing
Switch Design
Channel Buffers
Flow Control


There are differences between LANs and
interconnection networks for parallel
processing.
Global Communications.
Winter 2007
81
SGI Origin Network





Named SPIDER (we’ll see why when we look
at it’s stricture), supports 1.56 GB/s total
bandwidth in both directions.
Each switch contains 6 pairs of unidirectional
links.
Two nodes are connected to each switch
leaving 4 links to connect to other switches.
Routing is table based. So that
Message priority is supported.
Winter 2007
82
SGI Origin Network
Winter 2007
83