No Slide Title

Download Report

Transcript No Slide Title

The Fork-Join Router
High Performance
Switching and Routing
Telecom Center Workshop: Sept 4, 1997.
Nick McKeown
Assistant Professor of Electrical Engineering
and Computer Science, Stanford University
[email protected]
http://www.stanford.edu/~nickm
Outline
• Quick Background on Packet Switches
• What’s the problem?
“What if data rates exceed memory bandwidth?”
• The Fork-Join Router
• Parallel Packet Switches
First Generation Packet Switches
Shared
Fixed length “DMA” blocks
Backplane
or cells. Reassembled on egress
linecard
Fixed length cells or
variable length packets
Buffer
Memory
CPU
DMA
DMA
DMA
Line
Interface
Line
Interface
Line
Interface
MAC
MAC
MAC
Second Generation Packet Switches
Buffer
Memory
CPU
DMA
DMA
DMA
Line
Card
Local
Buffer
Memory
Line
Card
Local
Buffer
Memory
Line
Card
Local
Buffer
Memory
MAC
MAC
MAC
Third Generation Packet Switches
Switched Backplane
Line
Card
CPU
Card
Line
Card
Local
Buffer
Memory
Local
Buffer
Memory
MAC
MAC
Fourth Generation Packet Switches
Two Basic Techniques
1+1 = 2 operations per cell time
Input-queued Crossbar
N+N = 2N operations per cell time
Shared Memory
Shared Memory
The Ideal
D T K
I P Z
A
ZZ
A
A A A A A A
AA
ZZZ
D A B H
X F
AA
Z
Numerous work has proven and
made possible:
–
–
–
–
–
Fairness
Delay Guarantees
Delay Variation Control
Loss Guarantees
Statistical Guarantees
Precise Emulation of an Output
Queued Switch
Output Queued Switch
1
N
=?
N
Combined Input-Output Queued Switch
Scheduler
N
Result
Theorem:
A speedup of 2-1/N is necessary and
sufficient for a combined input- and
output-queued switch to precisely
emulate an output-queued switch for
all traffic.
Joint work with Balaji Prabhakar at Stanford.
Outline
• Quick Background on Packet Switches
• What’s the problem?
“What if data rates exceed memory bandwidth?”
• The Fork-Join Router
• Parallel Packet Switches
Buffer Memory
How Fast Can I Make a Packet Buffer?
Buffer
Memory
64-byte wide bus
5ns SRAM
64-byte wide bus
Rough Estimate:
–
–
–
5ns per memory operation.
Two memory operations per
packet.
Therefore, maximum 51.2Gb/s.
–
In practice, closer to 40Gb/s.
Buffer Memory
Is It Going to Get Better?
Specmarks,
Memory size,
Gate density
Memory
Bandwidth
(to core)
time
time
Optical Physical Layers…
…are Going to Make Things “Worse”
DWDM:
– More l’s per fiber a more “ports” per switch.
– # ports: 16, …, 1000’s.
Data rate:
– More b/s per l a higher capacity.
– Data rates: 2.5Gb/s, 10Gb/s, 40Gb/s, 160Gb/s, …
Approach #1: Ping-pong Buffering
Buffer
Memory
64-byte wide bus
64-byte wide bus
Buffer
Memory
Approach #1: Ping-pong Buffering
Buffer
Memory
64-byte wide bus
64-byte wide bus
Buffer
Memory
Memory bandwidth doubled to ~80 Gb/s
Approach #2:
Multiple Parallel Buffers
aka Banking, Interleaving
Buffer
Memory
Buffer
Memory
Buffer
Memory
Buffer
Memory
Outline
• Quick Background on Packet Switches
• What’s the problem?
“What if data rates exceed memory bandwidth?”
• The Fork-Join Router
• Parallel Packet Switches
The Fork-Join Router
Router
rate, R
1
1
1
rate, R
2
rate, R
N
N
k
Bufferless
rate, R
The Fork-Join Router
• Advantages
– kh
– kh
– kh
a memory bandwidth i
a lookup/classification rate i
a routing/classification table size i
• Problems
– How to demultiplex prior to
lookup/classification?
– How does the system perform/behave?
– Can we predict/guarantee performance?
Outline
• Quick Background on Packet Switches
• What’s the problem?
“What if data rates exceed memory bandwidth?”
• The Fork-Join Router
• Parallel Packet Switches
A Parallel Packet Switch
1
rate, R
rate, R
1
2
N
k
Output
Queued
Switch
Output
Queued
Switch
Output
Queued
Switch
1
N
rate, R
rate, R
Parallel Packet Switch
Questions
1. Can it be work-conserving?
2. Can it emulate a single big output
queued switch?
3. Can it support delay guarantees,
strict-priorities, WFQ, …?
4. What happens with multicast?
Parallel Packet Switch
Work Conservation
1
R/k
R/k
2
rate, R
R/k
R/k
1
1
R/k
Input Link
Constraint
k
R/k
Output Link
Constraint
rate, R
Parallel Packet Switch
Work Conservation
1
R/k
54 1
R/k
2
R/k
2
rate, R
R/k
1
1
R/k
5 1rate,
1 4 R
3 2 1
R/k
k
3
Output Link
Constraint
Parallel Packet Switch
Work Conservation
S(R/k)
rate, R
rate, R
1
S(R/k)
N
1
2
k
S(R/k)
Output
Queued
Switch
Output
Queued
Switch
Output
Queued
Switch
S(R/k)
S(R/k)
1
N
S(R/k)
rate, R
rate, R
Precise Emulation of an Output
Queued Switch
Output Queued Switch
1
N
=?
N
Parallel Packet Switch
1
1
N
N
N
Parallel Packet Switch
Theorems
1. If S > 2k/(k+2) @ 2 then a parallel
packet switch can be workconserving for all traffic.
2. If S > 2k/(k+2) @ 2 then a parallel
packet switch can precisely emulate
a FCFS output-queued switch for all
traffic.
Parallel Packet Switch
Theorems
3. If S > 3k/(k+3) @ 3 then a parallel
packet switch can be precisely
emulate a switch with WFQ, strict
priorities, and other types of QoS,
for all traffic.
An aside
Unbuffered Clos Circuit Switch
Expansion factor required = 2-1/N
Clos Network
O1
m{
IX
b
O1
}m
OX
}m
I3
I1
I2
m{
I1
a
c
Ix
R middle
stage switches
<= min(R,m) entries in each row
<= min(R,m) entries in each column
O2
O3
Ox
b
Clos Network
m{
IX
O1
}m
OX
}m
Ox
b
I3
I1
I1 I2
m{
a
b
O1 O2 O3
c
stage switches
Ix
R middle
<= min(R,m) entries in each row
<= min(R,m) entries in each column
Define: UIL(Ii) = used links at switch Ii to connect to middle stages.
UOL(Oi) = used links at switch Oi to connect to middle stages.
If we wish to connect Ii to Oi:
When adding connection: |UIL(Ii)| <= m-1 and |UOL(Oi)| <= m-1
Worst-case: |UIL(Ii) U UOL(Oi)| = 2m -2
Therefore, if R >= 2m-2 there are always enough middle stages.
An aside
Unbuffered Clos Circuit Switch
Expansion a 2 - 4/(k+2)
Expansion factor required = 2-1/N
Fork-Join Router Project
What’s next?
• Theory:
– Extending results to distributed
algorithms.
– Extending results to multicast.
• Implementation/Prototyping:
– Under discussion...