Resilient Cell Resequencing in Terabit Routers Jon Turner

Download Report

Transcript Resilient Cell Resequencing in Terabit Routers Jon Turner

Resilient Cell Resequencing
in Terabit Routers
Jon Turner
[email protected]
http://www.arl.wustl.edu/~jst
Why Resequencing?
Input
Line
Cards
Load
Balancing
Stage

Output
Line
Cards
Shared
Memory
Switch
Elements
Multistage interconnection networks with buffered
switch elements and dynamic routing.
» scalable, bandwidth-efficient architecture
» makes good use of modern CMOS ICs - single chip can
provide 100 Gb/s thruput and buffer thousands of cells
‹#› - Jonathan Turner
Resequencing with Sequence Numbers
Sequence
Numbers
Added

Drawbacks
» each output needs N resequencing arrays
» initialization when line card comes on-line
» timeouts needed to cope with lost cells
» multicast requires per flow resequencing arrays
‹#› - Jonathan Turner
Reseq.
Arrays
indexed by
seq. #
Time-Based Resequencing
Timestamp
Ordered
Reseq.
Buffer
Timestamps
Added
Output
Buffer
Single resequencing buffer per output.
 Cells held until “age” exceeds threshold (T ).
 Options for late cells.

» discard (strict resequencing) or buffer (loose)
‹#› - Jonathan Turner
Henrion’s Strict Resequencer Design
T slot
timing
wheel
current time
modulo T
insert at
timestamp
mod T
arriving cell


ts
“ready”
cells
Implemented using linked lists in common memory.
Constant time per cell.
‹#› - Jonathan Turner
Implementing Loose Resequencing

Cannot just insert late cells into output list.
next
“Normal”
insertion
range
T
arriving cell ts1

“ready”
cells
late arriving cell ts2
ts2+T
next
“waiting”
cells
ts1+T
new cell
inserted at
timestamp+T
Only approximates loose resequencing.
» must still discard “really late” cells.
» low cost of large timing wheel allows good approximation
‹#› - Jonathan Turner
Fast-Forwarding the Lag Pointer

Lag pointer must advance to next non-empty list on
every clock tick for constant time operation.
» no time to check successive pointers in timing wheel
» use fast-forward bits to speed-up process
summary
word

101010001
bit for
101001101
current
000000000
time
100000100
000000000
000000000
1 0 0 1 1 0 0 0 1 next non-empty
1 0 1 0 0 1 1 0 1 timing wheel slot
Two memory reads suffice to find next slot.
» 32 bit words allows range of 1024
» 128 bit words allows range of 16,384
‹#› - Jonathan Turner
Synchronization

Time-based resequencing requires synchronization
of all line cards.
» in small routers, requires just a common backplane signal
» in large routers, line cards connected to network only by
optical data cables

Requires low-level clock synchronization protocol.
» “master” line card issues periodic broadcast
synchronization messages
» network forwards sync messages with constant delay

only approximate synchronization is necessary
» new clock master selected on failure

Independent line card clocks require adjustments.
» suspend transmission when delaying clock
‹#› - Jonathan Turner
Performance of Strict Resequencing


Simple random traffic.
3 stage network, 8 port SEs, 512 (shared) cell buffers.
1st stage SEs use round robin load balancing for each input.
Late Probability
1.E+00
1.E-01
1.E+00
simple random traffic
fixed threshold, T =128
1.E-02
Late Probability

speedup=1.01
1.E-03
1.03
1.E-04
1.05
1.E-05
0.95
0.96
0.97
0.98
Input Load
0.99
1.00
Late cells rare
with small
speedup
‹#› - Jonathan Turner
simple random traffic
fixed threshold
input load = 1.0
1.E-01
1.E-02
T =64
1.E-03
1.E-04
128
256
1.E-05
1.01
1.03
1.05
1.07
Speedup
For systems with 10G links,
delay for 256 cells is 10 ms.
1.09
Performance on Adversarial Traffic
2:1 overload
at “target”
output
450
Delay drops as
SE buffers
drain
overload
400
350
fixed, strict, T =128
network delay
300
age of
oldest cell
250
200
150
age
threshold
poor
performance
100
resequencer
occupancy
50
0
250
500
Growing
network delay
‹#› - Jonathan Turner
750
1000
Time
Cells discarded when
delay exceeds T
1250
1500
1750
Resequencer
recovers when delay
drops below T
2000
Performance on Bursty Traffic
Late Probability
1.E+00
1.E-01
speedup =1.1
1.E-02
1.3
1.E-03
Poor
bursty trafficperformance
even for large
strict, fixed, T =128
speedups.
1.E-04
1.E-05
1



1.5
2
3
4
5
6
7
Mean Dwell Time
8
9
10
100% input load.
Input picks an output at random – overload at 1 in 4 outputs
Stays with target output for geometrically distributed time.
‹#› - Jonathan Turner
Loose Reseq. with Adversarial Traffic
Arriving cells are
younger than oldest
waiting cells, so no
resequencing errors
450
400
fixed, loose, T =128
overload
350
300
age of
oldest cell
250
200
network
delay
150
100
age
threshold
resequencer
occupancy
50
0
250
500
750
1000
1250
Time
‹#› - Jonathan Turner
1500
1750
2000
Loose Reseq. with Bursty Traffic
Late Probability
1.E+00
speedup =1.1
1.E-01
1.3
1.E-02
1.5
1.E-03
bursty traffic
loose,fixed, T =128
1.E-04
1.E-05
0
5
10
15
20
25
30
Mean Dwell Time Tolerates about 3x
longer bursts than
strict reseq.
‹#› - Jonathan Turner
Adaptive Resequencing

Adjust age threshold to match observed delay.
» parameters: window size (W), short term delay
difference bound (D)
» variables: max delay in current measurement window (d0)
and previous measurement window (d-1)
» age threshold = D + max{d0, d-1}
» implement by extending loose, fixed threshold design
Theorem. (simplified). If cell c1 enters
interconnection network just before c2 and exits
no later than D after c2, then adaptive resequencer
with WD forwards c1 before c2.
 Resequencing errors caused by excessive delay
variability, rather than large delays.

‹#› - Jonathan Turner
Performance on Adversarial Traffic
450
400
Arriving cells are
overload
younger than
oldest
waiting cells, so no
resequencing errors
350
300
Resequencer adds
small increment to
network delay.
age of
oldest cell
age
threshold
250Age threshold tracks
200 network delay.
150
100
adaptive, W =D=32
network
delay
resequencer
occupancy
50
0
250
500
750
1000
1250
Time
‹#› - Jonathan Turner
1500
1750
Resequencer
occupancy stays
bounded.
2000
Performance on Bursty Traffic
1.E+00
Late Probability
bursty traffic, adaptive, D=64
Performance
degrades when
1.E-01
switch
buffers fill.
Caused by delay
variation in first
1.E-02
stage.
late+input loss
speedup =1.1
1.E-03
Less sensitive to
extremely large
bursts
1.3
1.E-04
1.5
1.E-05
0
10
20
30
40
50
60
Mean Dwell Time Tolerates about 2x
longer bursts than
loose reseq.
‹#› - Jonathan Turner
Boosting Performance for Long Bursts
1.E+00
Late Probability
1.E-01
bursty traffic, dwell=100, speedup=1.2
512
256
128
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
1.E-07
adaptive
resequencer
20
40
60
64
first stage
32 buffer capacity
80
Smallest buffer
reduces switch
throughput by 2%.
‹#› - Jonathan Turner
Limiting first
stage buffering
cuts variability.
100 120 140 160 180 200
D
Small first stage
buffer gives good
performance with
modest “extra” delay
Speeding up Threshold Reductions
Downward age threshold adjustments are delayed
by window mechanism.
 Can reduce delay using “finer-grained” windows.

» max delay d0,d-1,d-2, . . . in k measurement windows
» age threshold = D + max{d0, d-1,d-2, . . .}
Theorem. (simplified). If cell c1 enters
interconnection network just before c2 and exits
no later than D after c2, then adaptive resequencer
with (k-1)WD forwards c1 before c2.
 For large k, max delay in threshold adjustment is
cut almost in half.

‹#› - Jonathan Turner
Closing Remarks

Adaptive resequencing can virtually eliminate
resequencing errors in multistage networks.
» to handle most extreme traffic, need to limit delay
variability in load balancing stages
» in systems that regulate the overall flow of traffic,
extreme cases should not arise at all
Henrion’s strict resequencer can be modified for
adaptive resequencing – O(1) time per cell.
 More sophisticated interconnection network can
increase throughput and reduce delay variability.

» per destination queues with controlled buffer sharing
» per destination flow control and load balancing
» timestamp-ordered switch element queues
‹#› - Jonathan Turner