3-1_Switching_rt.ppt
Download
Report
Transcript 3-1_Switching_rt.ppt
Switching
and Router Design
2007. 10
A generic switch
Classification
Packet vs. circuit switches
packets
have headers and samples don’t
Connectionless vs. connection oriented
connection
oriented switches need a call setup
setup is handled in control plane by switch controller
connectionless switches deal with self-contained
datagrams
Packet
switch
Circuit
switch
Connectionless Connection-oriented
(router)
(switching system)
Internet router ATM switching
system
Telephone switching
system
Requirements
Capacity of switch is the maximum rate at which it
can move information, assuming all data paths are
simultaneously active
Primary goal: maximize capacity
subject
to cost and reliability constraints
Circuit switch must reject call if can’t find a path for
samples from input to output
goal:
minimize call blocking
Packet switch must reject a packet if it can’t find a
buffer to store it awaiting access to output trunk
goal:
minimize packet loss
Don’t reorder packets
Outline
Circuit switching
Packet switching
Switch
generations
Switch fabrics
Buffer placement (Architecture)
Schedule the fabric / crossbar /
backplane
Routing lookup
Packet switching
In a packet switch, For every packet, you must:
Do a routing lookup: where (port) to send it
Datagram -- lookup based on entire destination
address (packets carry a destination field)
Cell -- lookup based on VCI
Schedule the fabric / crossbar / backplane
Maybe buffer, maybe QoS, maybe filtering by ACLs
Back-of-the-envelope numbers
Line cards can be 40 Gbit/sec today (OC-768)
To handle minimum-sized packets (~40B)
125 Mpps, or 8ns per packet
But note that this can be deeply pipelined, at the
cost of buffering and complexity. Some lookup
chips do this, though still with SRAM, not DRAM.
Good lookup algorithms needed still.
Router Architecture
Control Plane
How
routing protocols establish routes
etc.
Port mappers
Data Plane
How
packets get forwarded
Outline
Circuit switching
Packet switching
Switch
generations
Switch fabrics
Buffer placement
Schedule the fabric / crossbar /
backplane
Routing lookup
First generation switch - shared memory
Line card DMAs into buffer, CPU examines header,
has output DMA out
Bottleneck can be CPU, host-adapter or I/O bus,
depending
Most Ethernet switches and cheap packet routers
Low-cost routers, Speed: 300 Mbps – 1 Gbps
Example (First generation switch)
Today router built with 1.33 GHz CPU - bottleneck
Mean
packet size 500 B
word (4B) access take 15 ns
Bus ~ 100 MHz, memory ~ 5 ns
1) Interrupt takes 2.5 µs per packet
2) Per-packet processing time: (200 instrcts) = 0.15 µs
3) Copying packet takes 500/4 *33 = 4.1 µs
Total time = 2.5 + 0.15 + 4.1 = 6.75 µs
4 instructions + 2 memory accesses = 33 ns (4B word)
=> speed is (500x8/6.75) = 600 Mbps
Amortized interrupt cost balanced by routing protocol
cost
Second generation switch - shared bus
Port mapping (forwarding decisions) in line cards
direct
transfer over bus between line cards
if no route in line card --> CPU (“slow” operations)
Bottleneck is the bus
Medium-end routers, switches, ATM switches
Speed: 10 Gbps (> 8* 1-st generation switch)
Third generation switches
- point-to-point (switched) bus (fabric)
+tag
----->
Third generation (contd.)
Bottleneck in second generation switch is the bus
Third generation switch provides parallel paths
(fabric)
Features
self-
routing fabric (+ tag)
output buffer is a point of contention
unless we arbitrate access to fabric
potential for unlimited scaling, as long as we can
resolve contention for output buffer
High-end routers, switches
Speed: 1000 Gbps
Outline
Circuit switching
Packet switching
Switch
generations
Switch fabrics
Buffer placement
Schedule the fabric / crossbar /
backplane
Routing lookup
Buffered crossbar
What happens if
packets at two inputs
both want to go to
same output?
Can defer one at an
input buffer
Or, buffer
crosspoints
Broadcast
Packets are tagged with output port #
Each output matches tags
Need to match N addresses in parallel at output
Useful only for small switches, or as a stage in a
large switch
Switch fabric element
Can build complicated fabrics from a simple element
Self- routing rule: if tag = 0, send packet to upper
output, else to lower output
If both packets to same output, buffer or drop
Fabrics built with switching elements
NxN switch with bxb elements has log bN
stages with N / b elements per stage
ex:
8x8 switch with 2x2 elements has 3
stages with 4 elements per stage
Fabric is self-routing
Recursive
Can be synchronous or asynchronous
Regular and suitable for VLSI implementation
The fabric for Batcher-banyan switch
An example using Batcher-banyan switch
Outline
Circuit switching
Packet switching
Switch
generations
Switch fabrics
Buffer placement (Architecture)
Schedule the fabric / crossbar /
backplane
Routing lookup
Generic Router Architecture
Speedup
C – input/output link capacity
RI – maximum rate at which an
input interface can send data into
backplane
RO – maximum rate at which an
output can read data from
backplane
B – maximum aggregate
backplane transfer rate
input interface
Interconnection
Medium
(Backplane)
C
Back-plane speedup: B/C
Input speedup: RI/C
Output speedup: RO/C
output interface
RI B
RO
C
Buffering - three router architectures
Where should we place buffers?
Output
queued (OQ)
Input queued (IQ)
Combined Input-Output queued (CIOQ)
Function division
Input
interfaces:
Must perform packet forwarding – need to
know to which output interface to send
packets
May enqueue packets and perform scheduling
Output interfaces:
May enqueue packets and perform scheduling
Output Queued (OQ) Routers
Only output interfaces
store packets
Advantages
input interface
output interface
Backplane
Easy
to design
algorithms: only one
congestion point
Disadvantages
Requires
an output
speedup of N, where N is
the number of interfaces
not feasible
RO
C
Input Queueing (IQ) Routers
Only input interfaces store packets
Advantages
Easy
input interface
output interface
to built
Store packets at inputs
if contention at outputs
Relatively
Backplane
easy to design algorithms
Only one congestion point,
but not output…
need to implement backpressure
Disadvantages
In
RO
general, hard to achieve high utilization
However, theoretical and simulation results show
that for realistic traffic an input/output speedup of 2
is enough to achieve utilizations close to 1
C
Combined Input-Output Queueing
(CIOQ) Routers
Both input and output
input interface
output interface
interfaces store packets
Advantages
Backplane
Easy to built
Utilization 1 can be
achieved with input/output
speedup (<= 2)
Disadvantages
RO
C
Harder to design algorithms
Two congestion points
Need to design flow control
An input/output speedup of 2, a CIOQ can emulate
any work-conserving OQ [G+98,SZ98]
Generic Architecture of a High
Speed Router Today
CIOQ - Combined Input-Output Queued Architecture
Input/output
speedup <= 2
Input interface
Perform
packet forwarding (and classification)
Output interface
Perform
packet (classification and) scheduling
Backplane / fabric
Point-to-point
(switched) bus; speedup N
Schedule packet transfer from input to output
Outline
Circuit switching
Packet switching
Switch
generations
Switch fabrics
Buffer placement
Schedule the fabric / crossbar /
backplane
Routing lookup
Backplane / Fabric / Crossbar
Point-to-point switch allows to simultaneously
transfer a packet between any two disjoint pairs of
input-output interfaces
Goal: come-up with a schedule that
Maximize router throughput
Meet flow QoS requirements
Challenges:
Address head-of-line blocking at inputs
Resolve input/output speedups contention
Avoid packet dropping at output if possible
Note: packets are fragmented in fix sized cells
(why?) at inputs and reassembled at outputs
In Partridge et al, a cell is 64 B (what are the
trade-offs?)
Head-of-line Blocking
The cell at the head of an input queue cannot be
transferred, thus blocking the following cells
Cannot be transferred because
is blocked by red cell
Input 1
Output 1
Input 2
Output 2
Input 3
Cannot be
transferred
because output
buffer full
Output 3
To Avoid Head-of-line Blocking
Head-of-line blocking with only 1 queue per input
Max throughput <= (2-sqrt(2)) =~ 58%
Solution? Maintain at each input N virtual queues,
i.e., one per output
Requires N queues; more if QoS
The Way It’s Done Now
Input 1
Output 1
Output 2
Input 2
Output 3
Input 3
Cell transfer
Schedule: ideally, find the maximum number
of input-output pairs such that:
Resolve
input/output contentions
Avoid packet drops at outputs
Packets meet their time constraints (e.g.,
deadlines), if any
Example:
Use
stable matching
Try to emulate an OQ switch
Stable Marriage Problem
Consider N women and N men
Each woman/man ranks each man/woman in
the order of their preferences
Stable matching, a matching with no
blocking pairs
Blocking pair; let p(i) denote the pair of i
There
are matched pairs (k, p(k)) and (j,
p(j)) such that k prefers p(j) to p(k), and
p(j) prefers k to j
Gale Shapely Algorithm (GSA)
As long as there is a free man m
m
proposes to highest ranked women w in his
list he hasn’t proposed yet
If w is free, m an w are engaged
If w is engaged to m’ and w prefers m to m’, w
releases m’
Otherwise m remains free
A stable matching exists for every set of
preference lists
Complexity: worst-case O(N2)
OQ Emulation with a Speedup of 2
Each input and output maintains a preference list
Input preference list: list of cells at that input
ordered in the inverse order of their arrival
Output preference list: list of all input cells to be
forwarded to that output ordered by the times they
would be served in an Output Queueing schedule
Use GSA to match inputs to outputs
Outputs
initiate the matching
Can emulate all work-conserving schedulers
Example with a Speedup of 2
c.2
b.2
b.1
a.1
1 a.1
a
b
c.1
a.2
2 c.1
b
c
b.3
c.3
3
c
b.1
a.1
1
a
c.1
a.2
2
b.3
c.3
3
c.2
b.2
(b)
(a)
c.2
b.2
b.1
a.1
1 a.1
a
c.1
a.2
2 c.1
b.3
c.3
3 b.3
(c)
b.1
1 a.1
a
b
a.2
2 b.3
b
c
c.3
3 c.1
c
c.2
b.2
(d) after step 1
A Case Study
[Partridge et al ’98]
Goal: show that routers can keep pace with
improvements of transmission link bandwidths
Architecture
A
CIOQ router
15 (input/output) line cards: C = 2.4 Gbps (3.3
Gpps including packet headers)
Each input card can handle up to 16
(input/output) interfaces
Separate forward engines (FEs) to perform
routing
Backplane: Point-to-point (switched) bus, capacity
B = 50 Gbps (32 MPPS)
B/C = 50/2.4 = 20
Router Architecture
packet
header
Router Architecture
input interface
output interfaces
1
Data in
15
Backplane
Update
routing
tables
forward engines
Control data
(e.g., routing)
Network
processor
Data out
Set scheduling
(QoS) state
Router Architecture: Data Plane
Line cards
Input
processing: can handle input links up to 2.4
Gbps
Output processing: use a 52 MHz FPGA (Field
Programmable Gate Array); implements QoS
Forward engine:
415-MHz DEC Alpha 21164 processor, three level
cache to store recent routes
Up to 12,000 routes in second level cache (96
kB); ~ 95% hit rate
Entire routing table in tertiary cache (16 MB
divided in two banks)
Router Architecture: Control Plane
Network processor: 233-MHz 21064 Alpha running
NetBSD 1.1
Update
routing
Manage link status
Implement reservation
Backplane Allocator: implemented by an FPGA
The
allocator is the heart of the high-speed
switch
Schedule transfers between input/output
interfaces
Control Plane: Backplane Allocator
Time divided in epochs
16 ticks of data clock (8 allocation clocks)
Transfer unit: 64 B (8 data clock ticks)
Up to 15 simultaneous transfers in an epoch
One transfer: 128 B of data + 176 auxiliary bits
Minimum of 4 epochs to schedule and complete a
transfer but scheduling is pipelined.
1. Source card signals that it has data to send to the
destination card
2. Switch allocator schedules transfer
3. Source and destination cards are notified and told
to configure themselves
4. Transfer takes place
Flow control through inhibit pins
Early Crossbar Scheduling Algorithm
Wavefront algorithm
Observation:
Slow!
2,1 1,2 don’t
conflict with
each other
(36 “cycles”)
(11 “cycles”)
Do in groups, with
groups in parallel
(5 “cycles”)
Can find opt
group size, etc.
Problems: Fairness, speed, …
The Switch Allocator
Disadvantages of the simple allocator
Unfair:
a preference for low-numbered sources
Requires evaluating 225 positions per epoch,
which is too fast for an FPGA
Solution to unfairness problem: random shuffling
of sources and destinations
Solution to timing problem: parallel evaluation of
multiple locations
wavefront
allocation requires 29 steps for a 15 x
15 matrix
MGR allocator uses 3 x 5 groups
Priority to requests from forwarding engines over
line cards to avoid header contention on line cards
Alternatives to the Wavefront Scheduler
PIM: Parallel Iterative Matching
Request:
Each input sends requests to all
outputs for which it has packets
Grant: Output selects an input at random and
grants
Accept: Input selects from its received grants
Problem: Matching may not be maximal
Solution:
Run several times
Problem: Matching may not be “fair”
Solution:
random
Grant/accept in round robin instead of
iSLIP – Round-robin Parallel Iterative
Matching (PIM)
Each input maintains round-robin list of outputs
Each output maints round-robin list of inputs
Request phase: Inputs send to all desired output
Grant phase: Output picks first input in roundrobin sequence
Input
picks first output from its RR seq
Output updates RR seq only if it’s chosen
Good fairness in simulation
100% Throughput?
Why does it matter?
Guaranteed
behavior regardless of load!
Same reason moved away from cache-based
router architectures
Cool result:
Dai
& Prabhakar: Any maximal matching
scheme in a crossbar with 2x speedup gets
100% throughput
Speedup: Run internal crossbar links at 2x the
input & output link speeds
Summary: Design Decisions
(Innovations)
1.
Each FE has a complete set of routing tables
2.
A switched fabric is used instead of the traditional
shared bus
3.
FEs are on boards distinct from the line cards
4.
Use of an abstract link layer header
5.
Include QoS processing in the router
Outline
Circuit switching
Packet switching
Switch
generations
Switch fabrics
Buffer placement
Schedule the fabric / crossbar /
backplane
Routing lookup
Routing Lookup Problem / Port
mappers
Identify the output interface to forward an
incoming packet based on packet’s
destination address
Forwarding tables (a local version of the
routing table) summarize information by
maintaining a mapping between IP address
prefixes and output interfaces
Route lookup find the longest prefix in the
table that matches the packet destination
address
Routing Lookups
Routing tables: 200,000 – 1M entries
Router
must be able to handle routing table load
5 years hence. Maybe 10.
So, how to do it?
DRAM
(Dynamic RAM, ~50ns latency)
Cheap, slow
SRAM (Static RAM, <5ns latency)
Fast, $$
TCAM (Ternary Content Addressable Memory –
parallel lookups in hardware)
Really fast, quite $$, lots of power
Example
Packet with destination address 12.82.100.101 is
sent to interface 2, as 12.82.100.xxx is the longest
prefix matching packet’s destination address
Need to find longest prefix match
A standard solution: tries
128.16.120.xxx 1
12.82.xxx.xxx
12.82.100.xxx
…
3
2
…
12.82.100.101
1
128.16.120.111
2
Patricia Tries
Use binary tree paths to encode prefixes
Advantage: simple to implement
Disadvantage: one lookup may take O(m), where m is
number of bits (32 in IPv4)
1
0
Ex. m = 5
1
0
001xx
0100x
10xxx
01100
2
3
1
5
1
0
1
2
1
0
3
0
0
0
5
Two ways to improve performance
cache recently used addresses
move common entries up to a higher level (match
longer strings)
How can we speed “longest prefix
match” up?
Two general approaches:
Shrink
the table so it fits in really fast
memory (cache)
Degermark
et al.; optional reading
Complete prefix tree (node has 2 or 0 kids) can be
compressed well. 3 stages:
• Match 16 bits; match next 8; match last 8
Drastically
WUSTL
reduce the # of memory lookups
algorithm ca. same time (Binary search
on prefixes)
Lulea’s Routing Lookup Algorithm
Small Forwarding Tables for Fast Routing Lookups (Sigcomm’97)
Minimize number of memory accesses
Minimize size of data structure (why?)
Solution: use a three-level data structure
First Level: Bit-Vector
Cover all prefixes down to depth 16
Use one bit to encode each prefix
Memory
requirements: 216 = 64 Kb = 8 KB
root heads
genuine heads
First Level: Pointers
Maintain 16-bit pointers
2
bits encode pointer type
14 bits represent an index into routing
table or into an array containing level two
chuncks
Pointers are stored at consecutive memory
addresses
Problem: find the pointer
Example: find the pointer
0006abcd
000acdef
bit vector 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1
pointer
array
Routing
table
…
Problem:
find
pointer
…
Level two chunks
Code Word and Base Indexes Array
Split the bit-vector in bit-masks (16 bits each)
Find corresponding bit-mask
How? @ Maintain a16-bit code word for each bit-mask
(10-bit value; 6-bit offset)
@ Maintain a base index array (one 16-bit entry
for each 4 code words)
number of previous ones in the bit-vector
Bit-vector
Code word array
Base index array
First Level: Finding Pointer Group
Use first 12 bits to index into code word array
Use first 10 bits to index into base index array
first 12 bits
address: 004C
first 10 bits
4
1
Code word array
Base index array
13 + 0 = 13
First Level: Encoding Bit-masks
Observation: not all 16-bit values are possible
Example:
bit-mask 1001… is not possible (whynot?)
Let a(n) be number of non-zero bit-masks of length 2n
Compute a(n) using recurrence:
a(0)
=1
a(n) = 1 + a(n-1)2
For length 16, 678 possible values for bit-masks
This can be encoded in 10 bits
Values ri
in code words
Store all possible bit-masks in a table, called maptable
First Level: Finding Pointer Index
Each entry in maptable is an offset of 4 bits:
Offset
of pointer in the group
Number of memory accesses: 3 (7 bytes accessed)
First Level: Memory Requirements
Code word array: one code word per bit-mask
64
Kb
Based index array: one base index per four bit-mask
16
Kb
Maptable: 677x16 entries, 4 bits each
~
43.3 Kb
Total: 123.3 Kb = 15.4 KB
First Level: Optimizations
Reduce number of entries in Maptable by two:
Don’t
store bit-masks 0 and 1; instead encode
pointers directly into code word
If r value in code word larger than 676 direct
encoding
For direct encoding use r value + 6-bit offset
Levels 2 and 3
Levels 2 and 3 consists of chunks
A chunck covers a sub-tree of height 8 at most
256 heads
Three types of chunks
Sparse: 1-8 heads
8-bit indices, eight pointers (24 B)
Dense: 9-64 heads
Like level 1, but only one base index (< 162
B)
Very dense: 65-256 heads
Like level 1 (< 552 B)
Only 7 bytes are accessed to search each of levels
2 and 3
Notes
This data structure trades the table construction
time for lookup time (build time < 100 ms)
Good
trade-off because routes are not supposed
to change often
Lookup performance:
Worst-case:
101 cycles
A 200 MHz (2GHz) Pentium Pro can do at least
2 (20) millions lookups per second
On average: ~ 50 cycles
Open question: how effective is this data structure
in the case of IPv6 ?
Other challenges in Routing
Routers do more than just “longest prefix match” &
crossbar
Packet
classification (L3 and up!)
Counters & stats for measurement & debugging
IPSec and VPNs
QoS, packet shaping, policing
IPv6
Access control and filtering
IP multicast
AQM: RED, etc. (maybe)
Serious QoS: DiffServ (sometimes), IntServ (not)
Going Forward
Today’s highest-end: Multi-rack routers
Measured
in Tb/sec
One scenario: Big optical switch
connecting multiple electrical switches
Cool design: McKeown sigcomm 2003
paper
BBN MGR: Normal CPU for forwarding
Modern
routers: Several ASICs