3-1_Switching_rt.ppt

Download Report

Transcript 3-1_Switching_rt.ppt

Switching
and Router Design
2007. 10
A generic switch
Classification

Packet vs. circuit switches
 packets

have headers and samples don’t
Connectionless vs. connection oriented
 connection
oriented switches need a call setup
 setup is handled in control plane by switch controller
 connectionless switches deal with self-contained
datagrams
Packet
switch
Circuit
switch
Connectionless Connection-oriented
(router)
(switching system)
Internet router ATM switching
system
Telephone switching
system
Requirements

Capacity of switch is the maximum rate at which it
can move information, assuming all data paths are
simultaneously active

Primary goal: maximize capacity
 subject

to cost and reliability constraints
Circuit switch must reject call if can’t find a path for
samples from input to output
 goal:

minimize call blocking
Packet switch must reject a packet if it can’t find a
buffer to store it awaiting access to output trunk
 goal:

minimize packet loss
Don’t reorder packets
Outline

Circuit switching

Packet switching
Switch
generations
Switch fabrics
Buffer placement (Architecture)
Schedule the fabric / crossbar /
backplane
Routing lookup
Packet switching


In a packet switch, For every packet, you must:
 Do a routing lookup: where (port) to send it
 Datagram -- lookup based on entire destination
address (packets carry a destination field)
 Cell -- lookup based on VCI
 Schedule the fabric / crossbar / backplane
 Maybe buffer, maybe QoS, maybe filtering by ACLs
Back-of-the-envelope numbers
 Line cards can be 40 Gbit/sec today (OC-768)
 To handle minimum-sized packets (~40B)
 125 Mpps, or 8ns per packet
 But note that this can be deeply pipelined, at the
cost of buffering and complexity. Some lookup
chips do this, though still with SRAM, not DRAM.
Good lookup algorithms needed still.
Router Architecture

Control Plane
 How

routing protocols establish routes
etc.
 Port mappers
Data Plane
 How
packets get forwarded
Outline

Circuit switching

Packet switching
Switch
generations
Switch fabrics
Buffer placement
Schedule the fabric / crossbar /
backplane
Routing lookup
First generation switch - shared memory

Line card DMAs into buffer, CPU examines header,
has output DMA out

Bottleneck can be CPU, host-adapter or I/O bus,
depending

Most Ethernet switches and cheap packet routers

Low-cost routers, Speed: 300 Mbps – 1 Gbps
Example (First generation switch)

Today router built with 1.33 GHz CPU - bottleneck
 Mean
packet size 500 B
 word (4B) access take 15 ns

Bus ~ 100 MHz, memory ~ 5 ns

1) Interrupt takes 2.5 µs per packet

2) Per-packet processing time: (200 instrcts) = 0.15 µs

3) Copying packet takes 500/4 *33 = 4.1 µs


Total time = 2.5 + 0.15 + 4.1 = 6.75 µs


4 instructions + 2 memory accesses = 33 ns (4B word)
=> speed is (500x8/6.75) = 600 Mbps
Amortized interrupt cost balanced by routing protocol
cost
Second generation switch - shared bus

Port mapping (forwarding decisions) in line cards
 direct

transfer over bus between line cards
 if no route in line card --> CPU (“slow” operations)
Bottleneck is the bus

Medium-end routers, switches, ATM switches

Speed: 10 Gbps (> 8* 1-st generation switch)
Third generation switches
- point-to-point (switched) bus (fabric)
+tag
----->
Third generation (contd.)

Bottleneck in second generation switch is the bus

Third generation switch provides parallel paths
(fabric)

Features
 self-

routing fabric (+ tag)
 output buffer is a point of contention
 unless we arbitrate access to fabric
 potential for unlimited scaling, as long as we can
resolve contention for output buffer
High-end routers, switches

Speed: 1000 Gbps
Outline

Circuit switching

Packet switching
Switch
generations
Switch fabrics
Buffer placement
Schedule the fabric / crossbar /
backplane
Routing lookup
Buffered crossbar

What happens if
packets at two inputs
both want to go to
same output?

Can defer one at an
input buffer

Or, buffer
crosspoints
Broadcast

Packets are tagged with output port #

Each output matches tags

Need to match N addresses in parallel at output

Useful only for small switches, or as a stage in a
large switch
Switch fabric element

Can build complicated fabrics from a simple element

Self- routing rule: if tag = 0, send packet to upper
output, else to lower output

If both packets to same output, buffer or drop
Fabrics built with switching elements

NxN switch with bxb elements has log bN 
stages with N / b elements per stage
 ex:

8x8 switch with 2x2 elements has 3
stages with 4 elements per stage
Fabric is self-routing

Recursive

Can be synchronous or asynchronous

Regular and suitable for VLSI implementation
The fabric for Batcher-banyan switch
An example using Batcher-banyan switch
Outline

Circuit switching

Packet switching
Switch
generations
Switch fabrics
Buffer placement (Architecture)
Schedule the fabric / crossbar /
backplane
Routing lookup
Generic Router Architecture
Speedup

C – input/output link capacity

RI – maximum rate at which an
input interface can send data into
backplane

RO – maximum rate at which an
output can read data from
backplane

B – maximum aggregate
backplane transfer rate
input interface
Interconnection
Medium
(Backplane)
C

Back-plane speedup: B/C

Input speedup: RI/C

Output speedup: RO/C
output interface
RI B
RO
C
Buffering - three router architectures

Where should we place buffers?
 Output

queued (OQ)
 Input queued (IQ)
 Combined Input-Output queued (CIOQ)
Function division
 Input
interfaces:
 Must perform packet forwarding – need to
know to which output interface to send
packets
 May enqueue packets and perform scheduling
 Output interfaces:
 May enqueue packets and perform scheduling
Output Queued (OQ) Routers

Only output interfaces
store packets

Advantages
input interface
output interface
Backplane
 Easy

to design
algorithms: only one
congestion point
Disadvantages
 Requires
an output
speedup of N, where N is
the number of interfaces
 not feasible
RO
C
Input Queueing (IQ) Routers

Only input interfaces store packets

Advantages
 Easy
input interface
output interface
to built
Store packets at inputs
if contention at outputs

 Relatively
Backplane
easy to design algorithms
Only one congestion point,
but not output…
 need to implement backpressure


Disadvantages
 In
RO
general, hard to achieve high utilization
 However, theoretical and simulation results show
that for realistic traffic an input/output speedup of 2
is enough to achieve utilizations close to 1
C
Combined Input-Output Queueing
(CIOQ) Routers



Both input and output
input interface
output interface
interfaces store packets
Advantages
Backplane
 Easy to built
 Utilization 1 can be
achieved with input/output
speedup (<= 2)
Disadvantages
RO
C
 Harder to design algorithms
 Two congestion points
 Need to design flow control
 An input/output speedup of 2, a CIOQ can emulate
any work-conserving OQ [G+98,SZ98]
Generic Architecture of a High
Speed Router Today

CIOQ - Combined Input-Output Queued Architecture
 Input/output

speedup <= 2
Input interface
 Perform

packet forwarding (and classification)
Output interface
 Perform

packet (classification and) scheduling
Backplane / fabric
 Point-to-point
(switched) bus; speedup N
 Schedule packet transfer from input to output
Outline

Circuit switching

Packet switching
Switch
generations
Switch fabrics
Buffer placement
Schedule the fabric / crossbar /
backplane
Routing lookup
Backplane / Fabric / Crossbar




Point-to-point switch allows to simultaneously
transfer a packet between any two disjoint pairs of
input-output interfaces
Goal: come-up with a schedule that
 Maximize router throughput
 Meet flow QoS requirements
Challenges:
 Address head-of-line blocking at inputs
 Resolve input/output speedups contention
 Avoid packet dropping at output if possible
Note: packets are fragmented in fix sized cells
(why?) at inputs and reassembled at outputs
 In Partridge et al, a cell is 64 B (what are the
trade-offs?)
Head-of-line Blocking

The cell at the head of an input queue cannot be
transferred, thus blocking the following cells
Cannot be transferred because
is blocked by red cell
Input 1
Output 1
Input 2
Output 2
Input 3
Cannot be
transferred
because output
buffer full
Output 3
To Avoid Head-of-line Blocking


Head-of-line blocking with only 1 queue per input
 Max throughput <= (2-sqrt(2)) =~ 58%
Solution? Maintain at each input N virtual queues,
i.e., one per output
 Requires N queues; more if QoS
 The Way It’s Done Now
Input 1
Output 1
Output 2
Input 2
Output 3
Input 3
Cell transfer

Schedule: ideally, find the maximum number
of input-output pairs such that:
 Resolve

input/output contentions
 Avoid packet drops at outputs
 Packets meet their time constraints (e.g.,
deadlines), if any
Example:
 Use
stable matching
 Try to emulate an OQ switch
Stable Marriage Problem

Consider N women and N men

Each woman/man ranks each man/woman in
the order of their preferences

Stable matching, a matching with no
blocking pairs

Blocking pair; let p(i) denote the pair of i
 There
are matched pairs (k, p(k)) and (j,
p(j)) such that k prefers p(j) to p(k), and
p(j) prefers k to j
Gale Shapely Algorithm (GSA)

As long as there is a free man m
m


proposes to highest ranked women w in his
list he hasn’t proposed yet
 If w is free, m an w are engaged
 If w is engaged to m’ and w prefers m to m’, w
releases m’
 Otherwise m remains free
A stable matching exists for every set of
preference lists
Complexity: worst-case O(N2)
OQ Emulation with a Speedup of 2

Each input and output maintains a preference list

Input preference list: list of cells at that input
ordered in the inverse order of their arrival

Output preference list: list of all input cells to be
forwarded to that output ordered by the times they
would be served in an Output Queueing schedule

Use GSA to match inputs to outputs
 Outputs

initiate the matching
Can emulate all work-conserving schedulers
Example with a Speedup of 2
c.2
b.2
b.1
a.1
1 a.1
a
b
c.1
a.2
2 c.1
b
c
b.3
c.3
3
c
b.1
a.1
1
a
c.1
a.2
2
b.3
c.3
3
c.2
b.2
(b)
(a)
c.2
b.2
b.1
a.1
1 a.1
a
c.1
a.2
2 c.1
b.3
c.3
3 b.3
(c)
b.1
1 a.1
a
b
a.2
2 b.3
b
c
c.3
3 c.1
c
c.2
b.2
(d) after step 1
A Case Study
[Partridge et al ’98]

Goal: show that routers can keep pace with
improvements of transmission link bandwidths

Architecture
A
CIOQ router
 15 (input/output) line cards: C = 2.4 Gbps (3.3
Gpps including packet headers)
 Each input card can handle up to 16
(input/output) interfaces
 Separate forward engines (FEs) to perform
routing
 Backplane: Point-to-point (switched) bus, capacity
B = 50 Gbps (32 MPPS)
 B/C = 50/2.4 = 20
Router Architecture
packet
header
Router Architecture
input interface
output interfaces
1
Data in
15
Backplane
Update
routing
tables
forward engines
Control data
(e.g., routing)
Network
processor
Data out
Set scheduling
(QoS) state
Router Architecture: Data Plane

Line cards
 Input

processing: can handle input links up to 2.4
Gbps
 Output processing: use a 52 MHz FPGA (Field
Programmable Gate Array); implements QoS
Forward engine:

415-MHz DEC Alpha 21164 processor, three level
cache to store recent routes
 Up to 12,000 routes in second level cache (96
kB); ~ 95% hit rate
 Entire routing table in tertiary cache (16 MB
divided in two banks)
Router Architecture: Control Plane

Network processor: 233-MHz 21064 Alpha running
NetBSD 1.1
 Update

routing
 Manage link status
 Implement reservation
Backplane Allocator: implemented by an FPGA
 The
allocator is the heart of the high-speed
switch
 Schedule transfers between input/output
interfaces
Control Plane: Backplane Allocator





Time divided in epochs
 16 ticks of data clock (8 allocation clocks)
Transfer unit: 64 B (8 data clock ticks)
Up to 15 simultaneous transfers in an epoch
 One transfer: 128 B of data + 176 auxiliary bits
Minimum of 4 epochs to schedule and complete a
transfer but scheduling is pipelined.
1. Source card signals that it has data to send to the
destination card
2. Switch allocator schedules transfer
3. Source and destination cards are notified and told
to configure themselves
4. Transfer takes place
Flow control through inhibit pins
Early Crossbar Scheduling Algorithm

Wavefront algorithm
Observation:
Slow!
2,1 1,2 don’t
conflict with
each other
(36 “cycles”)
(11 “cycles”)
Do in groups, with
groups in parallel
(5 “cycles”)
Can find opt
group size, etc.
Problems: Fairness, speed, …
The Switch Allocator

Disadvantages of the simple allocator
 Unfair:


a preference for low-numbered sources
 Requires evaluating 225 positions per epoch,
which is too fast for an FPGA
Solution to unfairness problem: random shuffling
of sources and destinations
Solution to timing problem: parallel evaluation of
multiple locations
 wavefront

allocation requires 29 steps for a 15 x
15 matrix
 MGR allocator uses 3 x 5 groups
Priority to requests from forwarding engines over
line cards to avoid header contention on line cards
Alternatives to the Wavefront Scheduler

PIM: Parallel Iterative Matching
 Request:

Each input sends requests to all
outputs for which it has packets
 Grant: Output selects an input at random and
grants
 Accept: Input selects from its received grants
Problem: Matching may not be maximal
 Solution:

Run several times
Problem: Matching may not be “fair”
 Solution:
random
Grant/accept in round robin instead of
iSLIP – Round-robin Parallel Iterative
Matching (PIM)

Each input maintains round-robin list of outputs

Each output maints round-robin list of inputs

Request phase: Inputs send to all desired output

Grant phase: Output picks first input in roundrobin sequence
 Input

picks first output from its RR seq
 Output updates RR seq only if it’s chosen
Good fairness in simulation
100% Throughput?

Why does it matter?
 Guaranteed

behavior regardless of load!
 Same reason moved away from cache-based
router architectures
Cool result:
 Dai
& Prabhakar: Any maximal matching
scheme in a crossbar with 2x speedup gets
100% throughput
 Speedup: Run internal crossbar links at 2x the
input & output link speeds
Summary: Design Decisions
(Innovations)
1.
Each FE has a complete set of routing tables
2.
A switched fabric is used instead of the traditional
shared bus
3.
FEs are on boards distinct from the line cards
4.
Use of an abstract link layer header
5.
Include QoS processing in the router
Outline

Circuit switching

Packet switching
Switch
generations
Switch fabrics
Buffer placement
Schedule the fabric / crossbar /
backplane
Routing lookup
Routing Lookup Problem / Port
mappers

Identify the output interface to forward an
incoming packet based on packet’s
destination address

Forwarding tables (a local version of the
routing table) summarize information by
maintaining a mapping between IP address
prefixes and output interfaces

Route lookup  find the longest prefix in the
table that matches the packet destination
address
Routing Lookups

Routing tables: 200,000 – 1M entries
 Router

must be able to handle routing table load
5 years hence. Maybe 10.
So, how to do it?
 DRAM
(Dynamic RAM, ~50ns latency)
 Cheap, slow
 SRAM (Static RAM, <5ns latency)
 Fast, $$
 TCAM (Ternary Content Addressable Memory –
parallel lookups in hardware)
 Really fast, quite $$, lots of power
Example



Packet with destination address 12.82.100.101 is
sent to interface 2, as 12.82.100.xxx is the longest
prefix matching packet’s destination address
Need to find longest prefix match
A standard solution: tries
128.16.120.xxx 1
12.82.xxx.xxx
12.82.100.xxx
…
3
2
…
12.82.100.101
1
128.16.120.111
2
Patricia Tries




Use binary tree paths to encode prefixes
Advantage: simple to implement
Disadvantage: one lookup may take O(m), where m is
number of bits (32 in IPv4)
1
0
Ex. m = 5
1
0
001xx
0100x
10xxx
01100
2
3
1
5
1
0
1
2
1
0
3

0
0
0
5
Two ways to improve performance
 cache recently used addresses
 move common entries up to a higher level (match
longer strings)
How can we speed “longest prefix
match” up?

Two general approaches:
 Shrink
the table so it fits in really fast
memory (cache)
 Degermark
et al.; optional reading
 Complete prefix tree (node has 2 or 0 kids) can be
compressed well. 3 stages:
• Match 16 bits; match next 8; match last 8
 Drastically
 WUSTL
reduce the # of memory lookups
algorithm ca. same time (Binary search
on prefixes)
Lulea’s Routing Lookup Algorithm
Small Forwarding Tables for Fast Routing Lookups (Sigcomm’97)

Minimize number of memory accesses

Minimize size of data structure (why?)

Solution: use a three-level data structure
First Level: Bit-Vector

Cover all prefixes down to depth 16

Use one bit to encode each prefix
 Memory
requirements: 216 = 64 Kb = 8 KB
root heads
genuine heads
First Level: Pointers

Maintain 16-bit pointers
2


bits encode pointer type
 14 bits represent an index into routing
table or into an array containing level two
chuncks
Pointers are stored at consecutive memory
addresses
Problem: find the pointer
Example: find the pointer
0006abcd
000acdef
bit vector 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1
pointer
array
Routing
table
…
Problem:
find
pointer
…
Level two chunks
Code Word and Base Indexes Array



Split the bit-vector in bit-masks (16 bits each)
Find corresponding bit-mask
How? @ Maintain a16-bit code word for each bit-mask
(10-bit value; 6-bit offset)
@ Maintain a base index array (one 16-bit entry
for each 4 code words)
number of previous ones in the bit-vector
Bit-vector
Code word array
Base index array
First Level: Finding Pointer Group

Use first 12 bits to index into code word array

Use first 10 bits to index into base index array
first 12 bits
address: 004C
first 10 bits
4
1
Code word array
Base index array
13 + 0 = 13
First Level: Encoding Bit-masks

Observation: not all 16-bit values are possible
 Example:

bit-mask 1001… is not possible (whynot?)
Let a(n) be number of non-zero bit-masks of length 2n

Compute a(n) using recurrence:
 a(0)

=1
 a(n) = 1 + a(n-1)2
For length 16, 678 possible values for bit-masks

This can be encoded in 10 bits
 Values ri

in code words
Store all possible bit-masks in a table, called maptable
First Level: Finding Pointer Index

Each entry in maptable is an offset of 4 bits:
 Offset

of pointer in the group
Number of memory accesses: 3 (7 bytes accessed)
First Level: Memory Requirements

Code word array: one code word per bit-mask
 64

Kb
Based index array: one base index per four bit-mask
 16

Kb
Maptable: 677x16 entries, 4 bits each
~

43.3 Kb
Total: 123.3 Kb = 15.4 KB
First Level: Optimizations

Reduce number of entries in Maptable by two:
 Don’t
store bit-masks 0 and 1; instead encode
pointers directly into code word
 If r value in code word larger than 676  direct
encoding
 For direct encoding use r value + 6-bit offset
Levels 2 and 3




Levels 2 and 3 consists of chunks
A chunck covers a sub-tree of height 8  at most
256 heads
Three types of chunks
 Sparse: 1-8 heads
 8-bit indices, eight pointers (24 B)
 Dense: 9-64 heads
 Like level 1, but only one base index (< 162
B)
 Very dense: 65-256 heads
 Like level 1 (< 552 B)
Only 7 bytes are accessed to search each of levels
2 and 3
Notes

This data structure trades the table construction
time for lookup time (build time < 100 ms)
 Good

trade-off because routes are not supposed
to change often
Lookup performance:
 Worst-case:

101 cycles
 A 200 MHz (2GHz) Pentium Pro can do at least
2 (20) millions lookups per second
 On average: ~ 50 cycles
Open question: how effective is this data structure
in the case of IPv6 ?
Other challenges in Routing

Routers do more than just “longest prefix match” &
crossbar
 Packet
classification (L3 and up!)
 Counters & stats for measurement & debugging
 IPSec and VPNs
 QoS, packet shaping, policing
 IPv6
 Access control and filtering
 IP multicast
 AQM: RED, etc. (maybe)
 Serious QoS: DiffServ (sometimes), IntServ (not)
Going Forward

Today’s highest-end: Multi-rack routers
 Measured

in Tb/sec
 One scenario: Big optical switch
connecting multiple electrical switches
 Cool design: McKeown sigcomm 2003
paper
BBN MGR: Normal CPU for forwarding
 Modern
routers: Several ASICs