Distributed Crossbar Schedulers

Download Report

Transcript Distributed Crossbar Schedulers

Distributed Crossbar Schedulers
Cyriel Minkenberg1, Francois Abel1, Enrico Schiattarella2
1 IBM Research, Zurich Research Laboratory
2 Dipartimento di Elettronica, Politecnico di Torino
HPSR 2006
OSMOSIS
Outline
 OSMOSIS overview
 Challenges in the OSMOSIS scheduler design
 Basics of crossbar scheduling
 Distributed scheduler
 Architecture
 Problems
 Solutions
 Results
 Implementation
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
OSMOSIS Overview
All-optical Switch
64 Ingress Adapters
64 Egress Adapters
8 Broadcast Units
128 Select Units
VOQs
Combiner
Tx
8x1
WDM
Mux
control
1
Optical
Amplifier
8x1
Tx
Fast SOA 1x8
Fiber
Selector
Gates
1x128
1
all-optical
packet transfer
5
1 packet
control
8
8x1
Star
Coupler
1
VOQs
2 Rx
1x8
2 Rx
4b SOA
64
waiting
control
1
Fast SOA 1x8
Wavelength
Selector
Gates
switch
command
EQ
EQ
control
128
64
control links
central scheduler
(bipartite graph matching algorithm)
2 request





3 central
scheduler
(BGM)
4a grant
64 ports @ 40 Gb/s, 256-byte cells => 51.2 ns time slot
Broadcast-and-select architecture (crossbar)
Combination of wavelength- and space-division multiplexing
Fast switching based on SOAs
Electronic input and output adapters, electronic arbitration
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
Architectural Scheduler Challenges
 Latency < 1 ls
 Pr: Long permission latency (RTT + scheduling)
 So: Speculation
 Multicast support
 Pr: Fair integration with unicast scheduling, control channel overhead
 So: Independent schedulers with filter, merge & feedback scheme
 Scheduling rate = cell rate
 Pr: Produce one high quality matching every 51.2 ns
 So: Deeply pipelined matching with parallel sub-schedulers (FLPPR)
 FPGA-only scheduler implementation
 Pr: Does a 64-port scheduler fit in one FPGA device?
 If not, how do we distribute it over multiple devices while maintaining an
acceptable level of performance?
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
inputs
request
matrix
inputs
Crossbar Scheduling: Bipartite Graph Matching
outputs
inputs
inputs
outputs
maximal,
size=2
maximal,
size=3
outputs
maximum,
size=4
outputs
 A crossbar is a non-blocking fabric that can transfer cells from any input
to any output with the following constraints:
 At most one cell from any input
 At most one cell to any output
 Equivalent to Bipartite Graph Matching (BGM)
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
Pointer-based Parallel Iterative Matching

One matching must be computed in every time slot, so we need fast and
simple algorithms

Suitable class of algorithms is parallel, iterative, and based on round-robin
pointers
 i-SLIP (McKeown), DRRM (Chao)

These algorithms have a number of desirable features:
 100% throughput under uniform i.i.d. traffic
 Starvation-free: any VOQ is served within finite time under any traffic pattern
 Iterative: sequential improvement of the matching by repeating steps
 Amenable to fast hardware implementation; high degree of parallelism and
symmetry
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
DRRM Operation
 Step 0: Initially, all inputs and outputs
are unmatched
 Step 1: Each unmatched input
requests the first unmatched output in
round-robin order for which it has a
packet, starting from pointer R[i]. R[i]
 (R[i] + 1) modulo N iff the request is
granted in Step 2 of the first iteration
 Step 2: Each output grants the first
input in round-robin order that has
requested it, starting from pointer G[o].
G[o]  (G[o] + 1) modulo N
 Iterate: Repeat Steps 1 and 2 until no
more edges can be added or a fixed
number of iterations are completed
 Key to good performance is pointer
desynchronization

If all VOQs are non-empty, pointers
eventually all point to different outputs

No conflicts: maximum performance
HPSR 2006
VOQ
state
input
selectors
output
selectors
IS[1]
OS[1]
IS[2]
OS[2]
IS[3]
OS[3]
IS[4]
OS[4]
© 2006 IBM Corporation
OSMOSIS
Distribution Issues
 Problem: Scheduler does not fit in a single device due to area constraints

Quadratic complexity growth of priority encoders
 Monolithic implementation (implicit temporal and spatial assumptions)


All results are available before the next time slot (or iteration)
All required information is available to all selectors
 Distributed implementation breaks these assumptions



Main problem: input selector issues a request at t0 and receives result (granted or not) at t0 + RTT
Input selector does not know results of requests issued during last RTT
Selectors are only aware of local status info (e.g. matches made in previous iterations)
 The time required for information to travel from the inputs to the outputs and back is called
round-trip time (RTT)
  = RTT / (cell duration)
input status update
and selection
IS[1]
IS[N]
RTT >> cell duration
request
RTT
grant
OS[1]
OS[N]
output selection
and status update
HPSR 2006
time
© 2006 IBM Corporation
OSMOSIS
Coping with Uncertainty (1)
 Problem: Uncertainty in the algorithm’s status
 The pointer-update mechanism breaks
– No desynchronization  Throughput loss
 Solution: Maintain a separate pointer set for each time slot in the RTT
 Basic idea: No pointer is reused before the last result is available
– Each input (output) selector maintains  distinct request (grant)
pointers, labeled R[ t ] and G[ t ], with t  [0,  -1]
– At time slot t the input selectors use set R[t mod ] to generate
requests; each request carries the ID of pointer set used
– Output selectors generate grants using G[ t ] in response to requests
from R[ t ]
 Each pointer set is updated independently from the others, so they all
desynchronize independently. Therefore, all the good features DRRM are
preserved
 Pointer sets are only updated once every RTT, hence they take longer to
desynchronize
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
Coping with Uncertainty (2)
 Problem: Uncertainty in the algorithm’s status
 The VOQ-state update mechanism breaks
– How many requests were successful?
– Excess requests may lead to “wasted” grants, leading to reduced
performance
 Solution: Maintain a pending request counter for every VOQ
 P(i,j) tracks the number of requests issued for VOQ(i,j) over the last RTT
– Increment when issuing new request
– Decrement when result arrives
 Filter requests: if P(i,j) exceeds the number of unserved cells in VOQ(i,j) do
not submit further requests
 This massively reduces the number of wasted grants
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
Multi-pointer Approach (RTT = 4)
VOQ
state
1
2
0
0
1
1
0
0

Hardware cost




pending
request
counters
R[t0;1]
R[t1;1]
G[t0;1]
IS[1]
OS[1]
R[t2;1]
R[t3;1]
request
pointer
set
IS[2]
OS[2]
grant
pointer
set
OS[3]
R[t;2]
0;3]
R[t
1
R[t
;2]
2
R[t3;2]
HPSR 2006
G[t3;1]
G[t
0;2]
R[t
1;2]
R[t
R[t2;2]
3;2]
IS[3]
request
pointers
G[t1;1]
G[t2;1]
R[t;2]
0;2]
R[t
1
R[t
;2]
2
R[t3;2]
R[t;2]
0;4]
R[t
1
R[t
;2]
2
R[t3;2]
( -1) additional pointers at each
input/output, each log2N bits wide
N2 pending request counters
N -to-1 multiplexers
Selection logic is not duplicated
G[t
0;3]
R[t
1;2]
R[t
R[t2;2]
3;2]
IS[4]
input
selectors
OS[4]
output
selectors
G[t
0;4]
R[t
1;2]
R[t
R[t2;2]
3;2]
grant
pointers
© 2006 IBM Corporation
OSMOSIS
Multiple Iterations

Additional uncertainty: Which inputs/outputs have been matched in
previous iterations?
1. Inputs should not request outputs that are already taken: Wasted requests
2. Outputs should not grant inputs that are already taken: Violation of one-toone matching property

Because of issue 2 above, the output selectors must be aware of all
grants in previous iterations, also by other selectors
 Implement all output selectors in one device

Input selectors use a request flywheel pointer to create request
diversity across multiple iterations

PRC filtering applies only to first iteration
 Can lead to “premature” grants
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
Distributed Scheduler Architecture
VOQ
state
input
selectors
IS[1]
output
selectors
OS[1]
control channel
control channel
Control
channel
interfaces
(each on a
control channel
separate card)
switch
command
channels
IS[2]
OS[2]
IS[3]
OS[3]
IS[4]
OS[4]
Allocators (on
midplane)
control channel
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
Performance Characteristics (16 ports)
Uniform Bernoulli traffic, RTT = 10
Uniform Bernoulli traffic, RTT = 4
1000
1 iteration
100
2 iterations
3 iterations
4 iterations
8 iterations
16 iterations
10
monolithic
Latency [time slots]
Latency [time slots]
1000
1 iteration
100
2 iterations
3 iterations
4 iterations
8 iterations
16 iterations
10
monolithic
1
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
0.1
0.2
0.3
0.4
Uniform Bernoulli traffic, RTT = 20
0.6
0.7
0.8
0.9
1
No PRCs, uniform Bernoulli traffic, RTT = 4
1000
1 iteration
2 iterations
3 iterations
4 iterations
100
8 iterations
16 iterations
monolithic
10
Latency [time slots]
1000
Latency [time slots]
0.5
Throughput
Throughput
1 iteration
100
2 iterations
3 iterations
4 iterations
8 iterations
16 iterations
10
monolithic
1
0
0.1
0.2
0.3
0.4
0.5
Throughput
HPSR 2006
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
© 2006 IBM Corporation
OSMOSIS
Optical Switch Controller Module (OSCM)
Midplane (OSCB; prototype shown here) with 40
daughter boards (OSCI; top right). Board layout
(bottom right)
HPSR 2006
© 2006 IBM Corporation
OSMOSIS
Thank You!
HPSR 2006
© 2006 IBM Corporation