Transcript Diapositiva 1 - Georgia Institute of Technology
HPI-DC'09
(in conjunction with CLUSTER'09)
Oblivious Routing Schemes in Extended Generalized Fat Tree Networks
New Orleans, 2009 Germán Rodríguez Cyriel Minkenberg Ramon Beivide Ronald P. Luijten Jesus Labarta Mateo Valero
Summary
● We describe previously well known regular modulo-based routing algorithms for
k
-ary
n
-trees.
● We extend and analyze these algorithms for a broader class of networks: XGFTs, including cost-effective variants of
k
ary
n
-trees ● We produce some combinatorial results that show that the two main variants for modulo-based algorithms perform equally well for a random distribution of traffic ● We identify two intrinsic flaws of oblivious modulo-based algorithms and propose a variant that improves over both.
2
Outline
● ● ● ● ● ● ● XGFT topologies: ●
k
-ary
n
-trees and more cost-effective variants.
● Routing (State of the Art) ● ● Random Modulo-radix variants: Source-Mod-k and Destination-mod-k Experimental environment Analysis of Modulo-radix algorithms Proposal Results – random NCA up/down Evaluation Conclusion 3
Extended Generalized Fat Trees I
● 0,0,0 0,0,1 0,0,2 0,1,0 XGFT ( 3 ; 3, 2, 2 ; 2, 2 ,3 ) 0,1,1 0,1,2 1,0,0 1,0,1 1,0,2 1,1,0 1,1,1 1,1,2 1,1,1 1,1,1 1,1,2 4-ary 2-tree 4
Extended Generalized Fat Trees II
● ● Number of nodes at level i, 0 < i < h
N i
j h
i
1
m j j i
1
w j
Each node can be labeled as a h-tuple: < W i , ... ,W 1 , , M h , ... M i+1 >, 0 ≤ M i ≤ m i , 0 ≤ W i ≤ w i which in combination with the level number
i
uniquely determines a node in the whole network
(first W’s, then M’s)
● Equivalent variations in the labeling schemes have been proposed
[Lin04,Gomez07]
XGFT ( 3 ; 3, 2, 2 ; 2, 2 ,3 ) 0,0,0 0,0,1 0,0,2 0,1,0 0,1,1 0,1,2 1,0,0 1,0,1 1,0,2 1,1,0 1,1,1 1,1,2 2 1 0,0,0 0,0,0 0,1,0 1,0,0 1,1,0 0,0,1 1 1,0,0 0,1,0 0 1,1,0 0,0,1 0 1 0,1,1 1,0,1 1,0,1 0,1,1 1,1,1 1,1,1 0,0,0 1,0,0 2,0,0 0,1,0 1,1,0 2,1,0 0,0,1 1,0,1 2,0,1 0,1,1 1,1,1 2,1,1 XGFT(3;4,4,4;1,4,1) – Slimmed tree 4-ary 2-tree 5
XGFTs and Contention
● ● XGFTs provide multiple paths for every pair of nodes: Proportional to the “number of parents” (w i ) parameters up to the Least/Nearer Common ancestors of Source
s
and Destination
d
.
Paths
(
s
,
d
)
NCA
level i
1 (
s
,
d
w
i
) ● ● Increasing the number of parents increases the cost.
●
k
-ary
n
-trees provide full-bisection and set a well-known trade-off between cost and performance Slimmed trees (with w i ≤ k) become more important with the increasing number of nodes ● Our analysis and proposal works better for slimmed trees than previous algorithms.
6
Related Work: Routing schemes
● Main Oblivious routing schemes for Fat Trees ● ● Random
[Valiant81][Greenberg85]
selection of upward paths Either Source
[Leiserson92][Ohrin95][Kariniemi06]
assignment of upward links modulo ● or Destination
[Lin04][Gomez07][Johnson08
] modulo assignment of upward links ● Pattern-aware (used in this work) ● Colored Heuristic
[Rodriguez09]
● We use it as a base-line for comparison 7
Random Routing I
●
The assignments of links to reach an NCA is totally random
● Idea: a random distribution should equally distribute the probability of having contention ● ● At each step choose a random parent until an NCA is reached, Then, follow the
unique
deterministic path down
S
Node 1 Node 10 8
Regular Routings (s mod k, d mod k)
● “Self-routing” approach ● ● At each step, choose the parent by getting doing a modulo operation (k) Difference: The label of the source or destination is used to go up to the tree only
source mod k destination mod k
<0,0,0> <0,0,1> <2,2,2>
<1,0,1> mod 3
=
(port) 0
<0,0,0> mod 3
=
(port) 0
<0,0,
0
> <0,1,1>
<2,2,2> mod 3
=
(port) 2
<0,0,0> mod 3
=
(port) 0
<0,0,0> Node <0,0,
0
>
<1,0,1> mod 3
=
(port) 1
<0,1,0> <0,0,0> Node 10 = <1,0,
1
> Dest 26 = <2,2,2> Node <0,0,0> 9 <0,
2
,0> <0,1,0> Node <1,0,1> <1,
2
,1>
<2,2,2> mod 3
=
(port) 2
Dest <2,2,
2
>
Combinatorial Analysis of Modulo-based algorithms:
• An interesting question arises: is any of the two variations (source or destination) of the modulo-based algorithms intrinsically better?
• Number of permutations routed ● By s-mod-k, by d-mod-k ● The same; why? ● Idea: For every P, exists Inverse (P) / if P has
c
conflicts with d-mod-k (details in the paper) conflicts with s-mod-k, Inverse of P has
c
• Number of general patterns (no permutations) routed ● By s-mod-k, by d-mod-k ● The same; why? ● Idea: decompose the pattern in all possible permutations ● Compute the maximum
c
of all possible permutations for s-mod-k ● Invert the decomposed permutations and apply the previous result, the union of the inverted permutations have the same maximum
c
for d-mod-k ● Look for more details in the paper 10
Experimental Setup
● ● Collection of application traces and pattern extraction Co-simulation approach
[Minkenberg09]
: ● ● Dimemas replays the MPI activity of the trace of an application Venus simulates the transmission of the messages with a detailed model of the network Execution of an Application
Applications/MPI
Visualization, Analysis Validation (Paraver) traces Myrinet ’s route files statistics routereader
Config File:
Adapter, Switch parameters, BW, Link delay routes Myrinet’s map files map2ned mapping topology ServerMod Traffic Generator Venus Simulator ClientMod Dimemas Simulator
Config File:
Links, Bandwidth, #buses, latency, Eager/rendez-vous, etc.
statistics traces
Detailed level of simulation
11
Applications
● WRF ● ● 256 processors Each process sends 2 outstanding sends to destinations +/- 16 nodes away (except the first and the last 16 processes) ● CG ● 128 processors 12
Results: WRF Progressive tree slimming
● ● Removing a single switch degrades the performance by 2 ● Removing 7 more middle switches has no impact for 3 routing schemes Regular modulo routings work very well (as good as the baseline), while Random does not.
WRF, progressive tree slimming
2 0 6 4 14 12 10 8 Random S mod k D mod k Colored Full-Crossbar 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 13
Modulo-based Algorithms look good
● A word about contention: ● Two main types: endpoint contention, and network fabric contention ● Endpoint contention arises because a node is performing multiple outstanding sends or receives and has less adapters than it needs.
● Network fabric contention arises because there are not enough network resources or the routing algorithm is not using them adequately.
● Modulo-based routing algorithms work by using node labels to go up to the tree, concentrating endpoint contention for every particular node to a specific NCA ● ● S-mod-k uses the source label – endpoint contention at the source is concentrated D-mod-k uses the destination label – endpoint contention at the destination is concentrated
However, modulo-based algorithms do not always work well...
14
Results: CG
● Oblivious routings cannot achieve the best performance ● It ’s a pathological case for modulo-based oblivious algorithms ● ● Random routing does not achieve good performance The oblivious strategies do not match the baseline 4 3.5
3 2.5
2 1.5
1 0.5
0
CG 128 processors, progressive tree slimming
Random S mod k D mod k Colored Full-Crossbar 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 15
Results: CG Communication Pattern
● Colored ● All phases take the same time 16 ● Destination Mod K ● Non-local phase takes 8 times longer?
Results: CG Communication Pattern congruent with the modulo algorithm
● ● Why do oblivious algorithms work badly with CG?
Only a phase in CG is non-local in our experiment: ● Each source sends to: ● destination = (source/2) * 16 + (source mod 2) ● Modulo-based routing algorithms in radix 16 networks ● OutputPort (destination) = ((source/2)
* 16
+ (source
mod 2
)) mod 16 == 0 or 1 ● Map the 16 outgoing communications to either port 0 or 1 ● ● 8 to each – 8 contending communications 14 unused ports in the switch … 17
Proposal: Random NCA up/down
Oblivious algorithms: What does d-mod-k or s-mod-k do?
Make certain “roots” responsible to route a collection of sources or destination.
The distribution of roots is even (for a
k
ary
n
-tree, but not for slimmed trees ).
Tries to concentrate endpoint contention either in the path up to the root (souce mod k) or down from the root (destination mod k) Idea: Each root is responsible to concentrate endpoint contention of a number of leaf nodes.
Even distribution of leaf nodes to roots should lead to good performance.
We can relabel the nodes and apply modulo-based algorithms to the new sources or destinations labels and define two families of algorithms: Random NCA up (using source labels) Random NCA down (using
d
labels) 18
A word on the results plots
● In each of the graphs there is a data point for: ● ● Source-mod-k (triangle up, centered) Destination-mod-k (triangle down, centered) 4 3.5
s-mod-k (centered) d-mod-k (centered) colored (centered) r-NCA-u (1st box) r-NCA-d (2nd box) Random (3rd box) CG.D, Progressive tree-slimming ● And three boxes with (minimum,1 st median, 2 nd quartile, quartile and maximum) for: ● ● ● Random Random NCA up Random NCA down 2.5
● Note that although the random algorithms results are based on the statistical collection of 20 to 60 0.5
experiments with different seeds, the variance in the performance might not be noticeable, thus a single horizontal line is the whole “box” 0 1 2 1.5
3 16 15 14 13 12 11 10 9 8 7 6 5 Value of w2 (#middle switches) for XGFT(2;16,16;1,w2) 4 3 2 19
Results: WRF
Random-NCA-up and Random-NCA-down are almost as good as S-mod-K and D-mod-k WRF, Progressive tree-slimming 16 14 s-mod-k (centered) d-mod-k (centered) colored (centered) r-NCA-u (1st box) r-NCA-d (2nd box) Random (3rd box) 12 10 8 6 4 2 0 16 15 14 13 12 11 10 9 8 7 6 5 Value of w2 (#middle switches) for XGFT(2;16,16;1,w2) 20 4 3 2 1
Results: CG
Random-NCA-up and Random-NCA-down are mid-way between S-mod-K and D-mod-k and the baseline.
CG.D, Progressive tree-slimming 4 3.5
s-mod-k (centered) d-mod-k (centered) colored (centered) r-NCA-u (1st box) r-NCA-d (2nd box) Random (3rd box) 3 2.5
2 1.5
1 0.5
0 16 15 14 13 12 11 10 9 8 7 6 5 Value of w2 (#middle switches) for XGFT(2;16,16;1,w2) 21 4 3 2
Routes per NCA
● ● ● 8000 7500 7000 Distribution of routes per NCA for several routing schemes ● X axis is the NCA number Left – non-slimmed ● Small variance of routes per NCA per routing and across ports Right – slimmed topology ● Source and destination modulo-based algorithms show a huge difference of routes assigned per NCA ● Random and the proposed family of random assignment of NCAs exhibit less variance across NCAs Distribution of Routes per NCA Distribution of Routes per NCA 8000 s-mod-k (1st, data point) d-mod-k (2nd, data point) Random (3rd, box) r-NCA-u (4th, box) r-NCA-d (5th, box) 7500 7000 6500 6500 6000 6000 5500 5500 5000 5000 4500 4500 4000 3500 0 1 2 3 4 5 6 7 8 9 NCA number (16 NCAs) 10 11 12 13 14 15 4000 3500 0 1 2 3 4 5 6 NCA number (10 NCAs) 7 s-mod-k (1st, data point) d-mod-k (2nd, data point) Random (3rd, box) r-NCA-u (4th, box) r-NCA-d (5th, box) 8 9 22
Conclusions
● Conclusions ● There are no fundamental differences in performance for typical communication patterns between source and destination modulo-based algorithms ● Modulo-based algorithms present an intrinsic flaw for slimmed trees ● Non-balanced distribution of routes per NCA can lead to increased network contention ● A
hybrid
approach (randomly selecting NCAs that become “endpoint-contention” concentrators) helps and could be used as a better oblivious approach for both non-slimmed and slimmed networks.
23
HPIDC ’09
THANKS
24
Q & A
25
Q & A
26
Node 1 27 Level 2 Level 1
Routing in XGFTs
● ● Selecting a link up-wards further limits the choice of links a the upper levels.
In pink : the switches that can be visited after selecting the first leftmost parent of level 1 and the second leftmost link up of level Level 2 Level 1 Level 2\ Level 1 Node 1 28 Node 1
XGFTs I
● Superclass of Fat Tree topologies: ●
XGFT( h ; m 1 , ... , m h ; w 1 , ... , w h )
●
h
is the height of the tree.
●
m i
is the number of children per node at level
i
.
●
w i
is the number of parents per node at level
i
.
XGFT(1;4,1) XGFT(1;4,2) XGFT(1;4,3) XGFT(1;4,4) 4-ary 1-tree 4-ary tree 29
Random Routing I
●
The assignments of links to reach an NCA is totally random
● Idea: a random distribution should equally distribute the probability of having contention ● Drawback I: Suboptimal link assignment given a pattern
S S
Nodes 1 - 9
S
Node 10 30
S S
Nodes 1 - 9
S
Node 10
Random Routing II
● Drawback II ● Even a single conflict halves performance 1 2 3 4 5 6 7
9
Links,
2 conflicts
for 3 pairs of nodes 8 9 31 1 2 3 4 5 6
6
Links,
No conflicts
7 8 9
Coupled effects
Topology Routing Communication Pattern Mapping Contention Performance Results
32