Oblivious Routing

Download Report

Transcript Oblivious Routing

Routing Algorithms
ECE 284
On-Chip Interconnection Networks
Spring 2014
1
Routing
• Will assume 2D mesh in this talk
• How flits are routed from source
to destination can greatly impact
network congestion
• Two types of routing:
• Oblivious routing: routing does not
consider or depend on the current
traffic condition
• Adaptive routing: takes into
consideration current traffic condition to determine the routing path
(tries to get around congested areas)
• Oblivious routing simpler (less expensive) to implement
• This talk will review existing oblivious routing algorithms
2
Routing Algorithm Objectives
• Maximize throughput
– How much load the network can handle
• Minimize latency
– Minimize routing delay between source and destination
3
Dimension-Ordered Routing (DOR)
(also called XY routing)
D
S
either minimal XY or YX routing to the destination
(here it uses XY route with probability 1.0)
4
DOR (XY) Routing with Uniform Traffic
• For an N = K x K mesh, N/2 nodes
are in the top half.
• 1/2 of its traffic will cross the
bisection.
• Traffic crossing bisection
uniformly distributed across K
channels.
• Therefore, maximum channel load
for DOR with uniform traffic is:
ϒ(DOR, uniform) = [ (N/2) * (1/2) ] / K = K/4
5
Problem with DOR (XY) Routing
• Minimal hop count
• But, in the worst-case, the links can get overly
congested. e.g., transpose traffic pattern.
(0, 0) -> (3, 3)
(1, 0) -> (3, 2)
(2, 0) -> (3, 1)
ϒwc(DOR) = K – 1 >> ϒ(DOR, uniform)
in the worst-case.
6
Valiant Load-Balancing (VAL) [1981]
D
S
randomly chosen
intermediate node
minimal XY routing to any intermediate node, then
minimal XY routing to destination node
7
Valiant Load-Balancing (VAL)
• Works by turning any traffic pattern into
2 phases of uniform traffic patterns, even
adversarial or worst-case traffic patterns.
• In effect, it evenly load-balances the traffic.
• Worst-case channel load
ϒwc(VAL) = 2 * ϒ(DOR, uniform) = 2 * (K/4) = K/2
which is 1/2 network capacity relative to DOR and uniform
traffic.
8
Valiant Load-Balancing (VAL)
• Effective network capacity normalized
throughput is 1/2 capacity.
• However, average hop count is 2X DOR.
• 1/2 capacity was thought to be the optimal
worst-case throughput for any routing
algorithm.
9
ROMM [1995]
D
S
intermediate node
randomly chosen
only in the minimal
direction to
destination
minimal XY routing to an intermediate node only in
the minimal direction, then minimal XY routing to
the final destination node
10
ROMM
• Tries to load-balance traffic by randomly
distributing traffic along all possible minimal
paths.
• Good that minimal number of hops is
guaranteed.
• But, turns out in the worst-case, ROMM
performs about as bad as DOR.
11
O1TURN [2005]
D
S
use both minimal XY and YX routing to the destination
(0.5 XY + 0.5 YX)
12
O1TURN
• Even though it only considers XY or YX path,
not all possible paths in VAL or all possible
minimal paths in ROMM, it is guaranteed to
achieve 1/2 capacity for the even radix case,
which has been shown to be optimal.
• For the odd radix case, O1TURN is very near
the optimal 1/2 capacity.
• Unlike VAL, O1TURN only uses minimal
routing paths, thus no penalty in hop count.
13
Comparison
Even Radix : Opt * 1
Odd Radix : Opt * (1 - 1 / K2)
0.5
0.4
0.3
0.2
0.1
VAL
DOR
ROMM
O1TURN
[adapted from Seo et al., O1TURN talk, ISCA 2005]
14
Simulation Results
Average Latency (cycle)
• 4 x 4 2D MESH – Uniform Random Traffic Pattern
200
DOR
ROMM
150
O1TURN
DUATO
100
50
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
[adapted from Seo et al., O1TURN talk, ISCA 2005]
15
Simulation Results
• 4 x 4 2D MESH – Matrix Transpose Traffic Pattern
Average Latency (cycle)
– One of the worst-case traffic pattern for DOR
200
DOR
ROMM
150
O1TURN
DUATO
100
50
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
[adapted from Seo et al., O1TURN talk, ISCA 2005]
16
Simulation Results
• 4 x 4 2D MESH – Bit Complement Traffic Pattern
Average Latency (cycle)
– Already balanced traffic pattern
200
DOR
ROMM
150
O1TURN
DUATO
100
50
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
[adapted from Seo et al., O1TURN talk, ISCA 2005]
17
Simulation Results
• 4 x 4 2D MESH – HOT SPOT Traffic Pattern
– 2 nodes have 20% of traffic
Average Latency (cycle)
200
DOR
ROMM
150
O1TURN
DUATO
100
50
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
[adapted from Seo et al., O1TURN talk, ISCA 2005]
18
Simulation Results
• Delay penalty of adaptive routing
– How the complexity of router implementation affects on latency
– Hot Spot Traffic Pattern
Average Latency (FO4)
2000
DOR
ROMM
1500
O1TURN
DUATO
1000
500
0
0
0.2
0.4
0.6
0.8
1
Throughput (flits / node / cycle)
[adapted from Seo et al., O1TURN talk, ISCA 2005]
19
U2TURN [2012]
• 1/2 capacity has been thought to be optimal
worst-case throughput for both odd and even
radices, and O1TURN is the state-of-the-art
for achieving this for the even radix case.
• But, turns out 1/2 capacity is not optimal for
the odd radix case.
20
U2TURN
• U2TURN considers all possible XYX and YXY 2-TURN paths, and
selects these paths with equal probability.
• XYX paths: randomly select a node on the same row and route
to it, followed by minimal YX routing to final destination.
• YXY paths: randomly select a node on the same column and
route to it, followed by minimal XY routing to final destination.
21
Analytical Results
• For the even radix case, worst-case capacity of
U2TURN = 1/2, same as VAL and O1TURN, which is
optimal.
• But, for the odd radix case, worst-case capacity of
U2TURN =
(K+1)/(2K+1) > 1/2
which is better than any existing routing algorithm.
22
Worst-Case Throughput
ROMM
VAL
DOR
U2TURN
O1TURN
Optimal routing
23
Throughput Comparison for Odd Radix
3X3 mesh
VAL DOR
O1TURN
U2TURN
Worst-case
0.5
0.33
0.44
Average-case
0.5
Transpose
5X5
VAL DOR
O1TURN U2TURN
0.57
0.5
0.3
0.48
0.55
0.405 0.477
0.604
0.5
0.44
0.53
0.632
0.5
0.33
0.67
0.8
0.5
0.3
0.6
0.75
Random
0.5
1
1
0.72
0.5
1
1
0.685
DOR-WC
0.5
0.33
0.67
0.8
0.5
0.3
0.6
0.75
Complement
0.5
0.67
0.67
0.57
0.5
0.6
0.6
0.55
Nearest-Neighbor
0.5
1.33
1.33
0.75
0.5
2.4
2.4
1.17
24
Throughput Comparison for Even Radix
4X4 mesh
VAL DOR
O1TURN
U2TURN
Worst-case
0.5
0.33
0.5
Average-case
0.5
0.48
Transpose
0.5
Random
6X6
VAL DOR
O1TURN U2TURN
0.5
0.5
0.3
0.5
0.5
0.54
0.64
0.5
0.47
0.556
0.65
0.33
0.67
0.8
0.5
0.3
0.6
0.75
0.5
1
1
0.7
0.5
1
1
0.682
DOR-WC
0.5
0.33
0.67
0.8
0.5
0.3
0.6
0.75
Complement
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
Nearest-Neighbor
0.5
2
2
1.1
0.5
3
3
1.27
25
Latency
• A potential concern about U2TURN is that it uses
non-minimal routing paths.
• However, U2TURN does a better job of loadbalancing traffic than O1TURN for difficult traffic
patterns.
• Hence, the queuing delay can be very high for
O1TURN for difficult traffic patterns, hence longer
latency despite fewer number of hops.
• Surprisingly, latency better for both odd and even
radix cases for difficult traffic patterns.
26
Latency for 7 x 7 Mesh
27
Latency for 8 x 8 Mesh
28
References
•
•
•
•
Valiant [L.G.Valiant et. al, ACM 1981]
ROMM [T.Nesson et. al, ACM 1995]
O1TURN [D. Seo et. al, ISCA 2005]
U2TURN [G. Sun et. Al, ICCD 2012]
29