Transcript Slide 1

How to Train your Dragonfly
EE382C FINAL PRESENTATION
MAY 24,2011
HYUNGMIN CHO
ANDREW DANOWITZ
MARIO FLAJSLIK
AMIMUL IHSAN
Outline
 Topology
 Routing
 Flow Control
 Hot Spot Management
 Status
Topology
 Dragonfly
 Minimizes expensive global
communication

No more than 4 hops

Modification: Each node
connected to two routers
Source: Lecture 7 Notes
Topology
 Assumptions
 Per Node Traffic: 5GB/s/node @10%
 All intergroup connections in optical cables
 Routers can have up to 107 ports
 Resulting design
 a=26, p=27, h=4
a
p
h_r
Nodes
per
Group
h_g
Global
Required Min
Total Bandwidth Average Router
Groups Cost ($)
(GB)
(GB)
Ports
Max
Router
Ports
Router Ports
Endpoint Router for global
Ports
Ports connection
64 32
1
48
2048
49 127,478
240
1004
127
128
64
64
48
26 40
4
97
1024
98 478,485
485
507
105
106
80
26
97
26 30
4
97
1025
98 478,485
485
508
85
86
60
26
97
26 27
4
97
1026
98 478,485
485
508
79
80
54
26
97
26 13
4
97
1027
98 478,485
485
509
51
52
26
26
97
Routing
Global Network
Group 0
Routing
decision
Group 2
…
Potential
congestion
…
h3
Router 0
Router 1
…
Router 2
h2
…
Router 1
…
Router 0
…
h0
h1
Group 1
Router 2
…
…
Local
Network
Local
Network
Figure modified from: Jiang, Dally, Kim: Indirect Adaptive Routing on Large Scale Interconnection Networks
…
Routing
 UGAL-L globally adaptive routing that chooses
between:


MIN – minimal path
VAL – non minimal path routing to a random group first
(Valiant load balancing)
 Choice made based on local queue information:
 qminHmin compared to: qvalHval
 Problems with limited throughput and higher
intermediate latency
Routing
 Problem: limited throughput due to imperfect load-
balancing of UGAL-L

UGAL-L will never route non-minimally through the same
router that is used for minimal routing
 Solution: UGAL-L using selective Virtual Channel
discrimination
Figure modified from: John Kim, Wiliam J. Dally, Steve Scott, and Dennis
Abts. 2008. Technology-Driven, Highly-Scalable Dragonfly Topology.
Routing
 Problem: High intermediate latency due to having to
fill up buffers before sensing congestion

Buffers still need to be sized correctly to achieve maximum
throughput
 Solution: Using credit round-trip latency to sense
and signal congestion
Figure from: John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts.
2008. Technology-Driven, Highly-Scalable Dragonfly Topology.
Flow Control
 Basic virtual-channel flow control with credit-based
backpressure
 Virtual Channel Flow Control
 6 VCs
3 for standard traffic
 3 for hotspot traffic

 Exploring Packet Sizes
 Running simulations with different packet sizes
Hotspot Management
 Tree saturation problem
 Worse with more path
diversity
 Non-interfering networks
 Separate VCs for hotspot and
non-hotspot traffic
Figure taken from: EE382C: Lecture15 slides
 In the project hotspot traffic is easily distinguished and
hotspot nodes are assigned statically:
 Class separation
Hotspot Management
 Dynamic hotspot detection
 Still use class separation to manage hotspots
 Statically assigned (or slow changing) hotspot nodes
 Detect hotspots at last hop routers (by counting packets) and propagate
information through the network
 Inspect queues for multiple packets going to the same destination, which
is then likely to be a hotspot
 Fast changing hotspot nodes
 Assumption is that traffic to hotspot nodes is going to spike after node
becomes hotspot
 Detect spikes by counting packets and looking for per destination peaks
 Use more virtual channels
 Impractical case of one VC per destination would solve the problem
 Use higher level QoS to do class separation
Status
 Bugs squashed to date: 2
 Topology
 Routing
 Flow Control
Status: Topology
 In progress
 Changing:
 Router per group no longer 2a
 # Groups no longer a*p+1
 Each node connected to two routers
 Downsized network of 1,024 nodes
Status: Traffic Pattern
 4 kinds of traffic patterns to implement
 3 patterns complete
 bit-reversal traffic pending

Requires the number of nodes to be power of 2
 Iteration of 30 requests-replies
 TrafficManager class has been modified extensively
Status: Routing
 UGAL-L algorithm
 Default function in Dragonfly.cpp
 Minimum routing okay on uniform traffic
 Working on UGAL-LCR
 Credit mechanism needs to be changed
Status: Flow Control
 VC size: 256 flits
 Non-interfering networks
 Separate VC set for the hotspot traffic class
 3 VCs are dedicated for hotspot traffic

Exclusively for hotspot traffic
 Divide the messages into packets
 Started requests and replies at {10,10,10}
 Iterating size to: {20,20,20}, etc.
Questions