Load-balanced optical packet switch using two-stage time-slot interchangers IEICE 2004 A. Cassinelli, A. Goulet, M.

Download Report

Transcript Load-balanced optical packet switch using two-stage time-slot interchangers IEICE 2004 A. Cassinelli, A. Goulet, M.

Load-balanced optical packet
switch using two-stage time-slot
interchangers
IEICE 2004
A. Cassinelli, A. Goulet, M. Ishikawa
University of Tokyo, Department of Information Physics and Computing
M. Naruse, F. Kubota
National Institute of Information and Communications Technology
Plan of the presentation
I. Introduction:
- The Ideal Packet Switch and Our goals
- Some Assumptions
- The BVN Switch and its Scheduling Complexity Bottleneck
II. The Load Balanced BVN Switch
- LB stage simplifies scheduling of a BVN switch non-uniformities.
- makes the switch performance independent from traffic
III. Optical implementation of the Load Balancing Switch
Load balancing architectre allows for simple, deterministic buffer schedule, ideal
for optical implementation using fiber delay-line based TSI...
III.1 Single Stage TSI and resulting LBS performance
III.2 Double Stage TSI and resulting LBS performance
IV. Conclusion and Further Research
V. References
I. Introduction
The ideal packet switch should:
•
Provide high throughput for any kind of traffic
•
Be stable
– queues in buffer should remain bounded
•
Have low delays
•
Manage priority traffic
– provide throughput guarantees for some ports
– provide reduced delays for such traffic
Our goal here:
• develop an “ideal” optical packet switch for TDM, possibly for asynchronous
optical networks (WDM remains an additional dimension).
• do that without using non-mature RAM optical memories - only delay lines.
Some preliminary assumptions
• Time is “slotted”, packets have the same size and are “aligned”
• At most one packet arrives per time slot at each input line (no WDM)
• The output lines are not overloaded (traffic is “admissible”)
The BVN switch
Given these assumptions, a good switch
candidate is the so called “Birkhoff-Von Newmann
Switch”, first proposed by Chang [1999], based
on the works of Birkhoff [1946] and VonNewmann [1953].
Essentially, it is a Crossbar Switch that:
BVN scheduler
•
has Virtual Output Queues (VOQ) to alleviate HOL blocking,
•
Relies on an efficient but rather time-consuming O(N4.5) scheduling
algorithm in order to find the appropiate sequence of crossbar states
to service the VOQs, avoid their saturation and reduce packet delay.
... but today there is an additional constraint: given the speed of today
networks, schedulers are running short of time for computation!!
700
(from McKeown –
Stanford University)
600
500
400
Clock cycles allowed to schedule a
single packet
300
200
(40 Gb/s =>  11 ns
per ATM packet, or
10 cycles in a 1 GHz
computer...)
100
0
1996
1997
1998
1999
2000
2001
So, the “ideal switch” must also rely on a scheduling algorithm with
very low computational complexity.
There is hope...
• It is relatively easy to prove that if the traffic is uniform, then the BVN decomposition consist
on a set of N permutations providing full-access. These can be cycled blindly in order to serve
the VOQs.
... this would mean an O(1) scheduling complexity
• The only condition over this set of N permutations is that they provide full-access (i.e., for
any input-ouput pair, at least one permutation out of this set connect these input and output).
Ex: one cycle for N=4
p0
p1
,
p2
,
p3
,
full-access
So...
Is there a way to pre-process an irregular
traffic load such that the inputs of the
switch “see” an uniform load?
Answer:
Yes! It is called “Load Balancing”.
There are several ways to do that...
The simplest (deterministic) consist on adding an
additional input switch stage, which runs through a
periodic sequence of connection patterns that
realize full access...
II. Deterministic Load Balancing
Deterministic Load-Balancing is achieved by running an input switch through a
sequence of periodic connection patterns that realize full access...
0
0 0 2 1 1 0 0 0
1
1
0 0 3
load balancing
2 3
LoadBalancing
...
...
1
N-1
“subdued” traffic!
3 3 3 3 3 1 1 1
…
“wild” input traffic pattern...
(1) input load balancing...
0
0
1 0
3
2 2
1 2 1
3
0 2 1
2 0 3 2 0
3 1 1
1 3 0
t
t
input traffic
uniformly distributed traffic
(2) destination (output) balancing...
0 0 2 1 1 0 0 0
0 0 3 2 1 2 1 3
load balancing
1 1 1 2 2 2 3 3
3 3 3 3 3 1 1 1
1 0 0 3 2 0 2 1
3 1 2 0 3 2 0 2
0 0 0 0 2 2 2 2
0 3 1 1 2 1 3 0
t
bursty input traffic
t
uniformly distributed traffic
 (1) Input load is equally distributed at the outputs
 (2) Bursty traffic is also distributed
The Load-balanced BVN Switch
• The Load-Balancing stage runs through a periodic sequence of connection patterns that
realize full access... just like the Crossbar Stage, because traffic it sees is just uniform.
Buffer (VOQ)
stage
Load balancing
stage
1
Crossbar
(TDM)
Stage
…
…
…
…
…
…
N-1
LoadBalancing
Stage
0
…
…
1
…
…
0
Crossbar
stage
N-1
A buffer maintains
N VOQ FIFO
queues
• Moreover, it is possible to prove that this two-stage architecture provides 100%
throughput on a very general class of traffic [Chang&Valiant]
III. Implementation of an
optical Load-balanced switch
Why the Deterministic LBS is suited for optical implementation?
(1) Given the particularly simple interconnection requirements (TDM
permutation schedule) of the load-balancing and switching stage, both
stages can be efficiently implemented using a guided-wave-based
Stage-Controlled Banyan Network (SC-BN);
(2) Because of the deterministic, cyclic schedule, it is possible to
emulate the VOQs FIFO queue stage using delay-lines, instead of
real RAM memory...
main topic of this presentation!
(1) Emulation of the load-balancing and TDM switches by
stage-controlled Banyan network (SC-BN)
• A N x N Banyan network is composed of log2N stages.
• Each stage is made of (N/2) 2 x 2 switches.
• In a SC-BN, all switches within a stage are set either in the
bar state or cross state.
• The N possible permutations of a SC-BN provide full access
stage 0
0
1
2
3
4
5
6
7
stage 1
stage 2
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
1
1
8 x 8 Omega network
0
1
2
3
4
5
6
7
Example: SC-BN with EA gates
8 x 8 Omega network
(2) Emulation of VOQ buffers using delay-lines
(a) ... A packet arrives at time t to port
1, with destination port N-1:
(c) ... but at time t, LBS permutation was
“scrambling” data, so packet is stored in queue N1 of a different buffer:
…
(b) If LBS were not operating, the packet
would be stored in queue N-1 of buffer N-1:
(d) Last, this packet has to wait a deterministic
amount of time, for the correct permutation to be
available at the second TDM stage:
…
(...plus a multiple of the whole cycle, if some packet
was previouly scheduled for the same output)
Concretely:
• A packet arriving to port r at time t with destination d has to be delayed by Dt = + kN
time slots where:
  d  r  t  modulo N
• While  is fixed by the packet, the parameter k can be freely tuned by the scheduling algorithm;
• Such “freedom” will be used to avoid collision of packets previously scheduled for the same
output, thus effectively simulating a FIFO queue. The way k is chosen depend on the actual TSI
architecture.
…
…
…
…
TSI “buffer”
Crossbar
(TDM)
Stage
…
…
…
a Time-Slot Interchanger architecture
(relying on delay-lines) will effectively
simulate the VOQs!
…
The nice thing is that because the total
delay can be computed in advance,
there is no need of real memory buffers:
III.1 : Single-stage TSI architecture
• number of delay lines: N.b
•delay increment = 1 time slot
• maximum delay: bN-1
• total fiber length:
(bN  1).bN
2
• equivalent VOQ FIFO size (equal to the
maximum delay +1 divided by N): b
optical
switch
2
0
1
...
1xNb
1
Nb
N.b-1
...performance of this
architecture is strictly
equivalent to that of a VOQ
based buffer when using a
deterministic schedule!
Contention Resolution
So, a packet arriving at t with destination d at the input of the optical buffer has to be delayed Dt
= + kN time slots, where  = (d-r-t) mod N.
Constraint: the packet may collide with another one when exiting the buffer at point A.
 k has to has to be choosen so as to avoid contention at the output of the TSI buffer
1xNb
2
0
1
risk of packet
collision
...
1
A
Nb
N.b-1
How? The maximum delay that a packet can be given is Nb-1:
 Need to keep track of the schedule of the Nb-1 previous time slots by using an electronic
memory of size Nb-1 (or, more simply, a single counter - but then the strategy does not generalize to multistage buffers).
 Check for a free schedule, i.e, choose a cycle-delay k indicating a free space. A maximum
of b checks are needed. In our simulations, k is choosen as the smaller index that indicates a free space, so
as to minimize packet delay, but more complex selection can be done to account for packet priorities.
Rem: if a packet cannot be scheduled, it will be discarded (so in fact the switch is a 1x(Nb+1) switch, whose last line is the
discard line.
Packet Schedule Memory
Example: N=4, b=3
occupied
A packet arrives at time t, when permutation P3 is on the
TDM switch. However, packet destination requires P1.
Then, we have Dt =2 + k.N (=2)

free
irrelevant for
scheduling this packet


Packet Schedule
(total memory
cells: Nb-1=11)
N



k.N
TDM Permutation
schedule:
0P3
1P0
2P1
3P2
0P3
P0
1
P1
2
P2
3
P
03
P
10
2P1
P
32
Time: t’=t +0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+10
+11
The resulting scheduling algorithm is O(b)
(and can be made constant using a single counter)
...
...
interesting remark: because contention is resolved by the
scheduling algorithm, the following hardware performs
equally well:
...the advantage being a large reduction on the number of fiber delay
lines employed: in the first case we need bN(bN-1)/2, while in the
second implementation only Nb.
This is important when considering scaling the number of input-outputs
(N) or the amount of buffering (b).
LBS performance using a single-stage TSI (simulation)
N=16 input/outputs
Packet loss probability
1.E-01
1.E-02
1.E-03
b=5
b = 10
b = 15
b = 20
b = 25
b = 30
108 packets / load
1.E-04
1.E-05
rem: b=30 corresponds to a
FIFO buffer holding a
maximum of 30 packets: this
is very little compared with
the thousands of some
shared memory buffers on
the market...
1.E-06
0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Load
(Rem: traffic is assumed to be i.i.d Bernouilli at the exit of the LB stage)
1
LBS performance using a single-stage TSI (simulation)
N=16 input/outputs
average delays (time slots)
200
b=5
b = 10
b = 15
b = 20
b = 25
160
120
108 packets / load
80
40
0
0.3
0.4
0.5
0.6
0.7
0.8
0.9
load
(Rem: traffic is assumed to be i.i.d Bernouilli at the exit of the LB stage)
1
feasability problems (single stage)
Broadcast & Select Module
(EA module)
Fiber delay line module
Merging module
(ブースタアンプ)
(プリアンプ)
EDFA
Saturated
output
20dBm
132 broadcast
+20dB
13dBm以下!
-15dB
EA and
Interfacing loss
+20dB
-10dB
仮定:入力信号レベル 0dBm
EA
Valid
Range
Fiber and interfacing loss
-2.5dB
Waveguide
and interfacing loss
-2.5dB
EDFA
Minimum
Input
-30dB !!!
-10dBm
Output
Input
EDFA
EDFA
0
1
0
2
b0
2
1
...
1xb0
1
...
Why? Because of architectural
considerations: for a constant
total amount of delay, a
multistage architecture uses
much less fiber delay lines
=> small switches!!
1xb1
III.2 : Double-Stage TSI buffer
b0
b1
(b1-1).b0
b1 FDLs
increment: b0 time slots
b0-1
b0 FDLs
increment: 1 time slot,
maximum delay: b0-1
• number of delay lines: b0+b1
... vs. b0.b1 in the case of a single stage.
•delay increment (depends on the stage): 1
for first stage, b0 for second stage.
•maximum delay: b1.b0-1
• total fiber length: b0 (b0  1)  b0b1 (b1  1)
2
• equivalent VOQ FIFO size:
b b 
B   0 1
 N 
• By making the minimal increment of first stage
equal to the maximum delay of first stage plus
one, we ensure a unique decomposition of the
required total delay Dt which further simplifies
scheduling complexity...
...in the following, we will consider that b0=N (the number of input outputs) and
b1=b will be variable, corresponding to the equivalent size of a VOQ FIFO buffer:
1
0
2
N
2
1
...
...
1xN
0
1xb
1
N
b
(b-1).N
cycle delay
stage (k)
N-1
sub-cycle delay
stage ()
• number of delay lines: b+N vs. b.N for single stage.
•delay increment (depends on the stage): 1 for first stage, N for second stage.
•maximum delay: b.N-1
• total fiber length: N ( N  1)  Nb(b  1)
... vs. (bN  1).bN for single stage.
2
2
• equivalent VOQ FIFO size: b=b1
Contention
S2
1
0
2
N
2
1
...
A
1xN
0
1xb
1
...
Now there are 2 locations where contention can happen:
- at the exit A of first stage (S1)
- at the exit B of second (and final) stage (S2)
S1
B
N
b
(b-1).N
N-1
Exit of the stage S1:
The maximum delay that a packet can be given in the stage S1 is (b-1)N time slots
 Need to keep track of the (b-1)N previous time slots.
 Need for an electronic memory MEM_S1 of size (b-1)N that will indicate
which time slots at the exit of S1 are “busy” or “free”.
Exit of the stage S2:
The maximum delay that a packet can be given for the whole optical buffer is
(b-1)N+N-1= bN -1
 Need to keep track of the bN -1 previous time slots.
 Need for an electronic memory MEM_S2 of size bN-1 that will indicate which
time slots at the exit of S2 are “busy” or “free”.
Rem: if a packet cannot be scheduled, it will be discarded on the first stage (so in fact the first stage switch is a 1x(b+1) switch,
whose last line is the discard line. Discarding a packet in other than the first stage would be necessary if one uses another
scheduling strategy – for instance, a non unique delay decomposition.
0
2
N
A
1
0
2
1
B
...
1xN
1
...
N
b
N-1
N.(b-1)
(in the example,
b1=b, and b0=N)
N
N
N
2
..
.
b
1xN
1
..
.
2
..
.
1xb
1
..
.

1xb
remark: again, the contention avoidance schedule
enables the following fiber-length-reducing architecture
to work equally well:
N
1
1
1
Temporal diagram of the permutation schedule, the first and the second
“crosspoint” schedules (MEM_S1, MEM_S2)
Example: N=4, b=3 (b1=b, and b0=N)
rem: these schedule positions do
not need to be stored in memory,
since they are always free at the
start of a scheduling cycle
N
MEM_S1
(b1-1)b0=(b-1)N=8
MEM_S2
b1b0-1=bN-1=11
0P3
Permutation schedule:
time: t’=t +0
1P0
2P1
3P2
0P3
P0
1
P1
2
P2
3
+1
+2
+3
+4
+5
+6
+7
P
03
+8
P
10
2P1
P
32
+9
+10
+11
The permutation schedule represents the available permutation at the exit of the TSI buffers at time
t’=t+k (there are N possible permutations). The permutation schedule is not computed as a function of
the traffic – as in a BVN switch. It is deterministic (TDM), therefore we do not need to store any
scheduling memory array.
...
N=4, b=3 (b1=b, and b0=N)
... a packet arrives at time t, such that the
requested permutation is P1. We have then =2.
occupied
free
irrelevant
MEM_S1



(b1-1)b0=(b-1)N=8
k=2
k=1
k=0
MEM_S2
b1b0-1=bN-1=11
N



k.N
Permutation schedule:
time:
0P3
1P0
2P1
3P2
0P3
P0
1
P1
2
P2
3
t’=t +0
+1
+2
+3
+4
+5
+6
+7
b1 pairs to check = > O(b1) schedule!!
P
03
+8
P
10
2P1
P
32
+9
+10
+11
...
The packet will be scheduled to go trough S1 at time
t’=t+2N=t+8, and will exit the network trough S2 at time
t’=t+2N+2 = t+10. Both cells in the considerer pair are made
“busy”, and then the arrays are shifted to the left by one.
In the previous example, b1=3 pairs had to be taken into consideration...
• In general a maximum of b1 memory locations have to be checked.
E2
E1
1xN
1xb
b1=3 lines
in first
stage
b0=4 lines
in second
stage
So, overall complexity of the scheduling algorithm is O(b1).
(a strategy using counters is not easy to implement, and may lead to suboptimal schedules)
One vs. Two buffer-stages
(for the same total fiber length)
N = 16
Nb packets = 107
1.E-01
Packet loss probability
1.E-02
1.E-03
1.E-04
b = 10
b = 20
b = 30
b = 10
b = 20
b = 30
1.E-05
single buffer
two buffers
1.E-06
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Load
This indicates that the collision avoidance at intermediate stage, sligthly
degrades performances => there is a trade-off between architectural
considerations and performance.
Conclusion
The proposed two-stage Load-balanced photonic switch:
•
Because it is a LBS, it can achieve high throughput under bursty traffic
•
Because deterministic balancing is used:
–
Guide-wave-integrable Stage-controlled Banyan networks can be used both for the switching
stage and the balancing stage.
–
•
no need to employ optical memories for buffering, only fiber-delay lines functionning as a TSI
Has a scheduling complexity in O(b) , where b is the equivalent size of a electronic FIFO
buffer.
•
Can (potentially) handle traffic priorities by making k priority-dependent.
•
performances only slighly degrades when comparing to a single-stage TSI (*), while:
•
–
making possible a very large reduction of the number of delay lines.
–
thus using “buffer space” more efficiently
it would be possible to modify the architecture so as to handle asynchronous traffic and
different length packets using only TSIs, as in [Harai].
(*) performance of a single-stage based photonic switch using Nb-1 FDLs are strictly equivalent
to that of a LBS using RAM buffers composed of N FIFO queues each of size b.
...Further Research: generic multi-stage delay-line buffers
There are thousands of ways of implementing a generic multistage buffer.
8-8-8-8
16-8-8
One that provides a unique decomposition of the scheduled delay, however, is
such that bi = li-1.bi-1 = l0l1l2…li-1. For the first stage S0, b0 is equal to a delay of
one time slot. Hence, the maximum delay that can be given to a packet by the
whole TSI is equal to B=l0l1l2…ln-1 (this is also the 16-64
maximum number of packets
that the TSI can hold). For a switch with N ports, it is comparable to N VOQ
queues of length Be = B/N.
Packet loss probability
N=64
1.E+00
packet loss probability
1.E-01
4096
64-64
32-32-4
16-16-16
8-8-8-8
4-4-4-4-4-4
1.E-02
1.E-03
1.E-04
1.E-05
0.65
0.7
0.75
0.8
0.85
load
0.9
0.95
1
Average packet delay
N=64
4000
4096
64-64
32-32-4
16-16-16
8-8-8-8
4-4-4-4-4-4
average delay (time slots)
3500
3000
2500
2000
1500
1000
500
0
0.5
0.6
0.7
0.8
load
0.9
1