Transcript Document
Internet Routers Case Study
Eric Keller
4/19/07 and 4/24/07
Outline
• Overview/Background
–
–
–
–
Landscape
Router components
RED/WRED
MPLS
• 5 Example systems (2 Cisco, Juniper, Avici,
Foundry)
• Software Routers
2
Choices choices…
3
Interface Speeds
What I’ll focus on
•Most interesting
architectures
•Lower end (I
think) will all be
mostly software
•I’ll talk about
Click for that
4
Source:Chidamber Kulkarni
The US backbone
Core routers are supposed to be as fast as possible
Edge routers are supposed to have the features
But, core routers, seemingly, have all the same
functionality as edge, just faster (due to blurring)
5
High Performance Switches and Routers, by
H. Jonathan Chao and Bin Liu
Internet Routers Components
• 4 Basic components common to 4 of 5
systems studied (other had first 3 cards
combined)
•
•
•
•
Interface Cards
Packet Processing cards
Switch Fabric Cards
Control Plane Cards
6
Data Path Functions
Network Processor
Ingress Traffic Manager
- Parse
- Identify flow
- Determine Egress Port
- Mark QoS Parameters
- Append TM or SF
Header
- Police
- Manage congestion (WRED)
- Queue packets in classbased VOQs
- Segment packets into
switch cells
Switch Fabric
- Queue cells in class based
VOQs
- Flow control TM per class
based VOQ
- Schedule class based VOQs
to egress ports
Ingress Line Card
Switch Fabric
-Reassemble cells into
packets
-Shape outgoing traffic
-Schedule egress traffic
Egress Line Card
TM
Scheduler
Incoming
packets
SF
Arbiter
Reassemble
Class based queueing of outgoing
packets
Segmentation + header
WRED
Egress Traffic Manager
Egress
Scheduler&
Shaper
Discard
SF
Flow Control
7
Source: Vahid Tabatabaee
RED/WRED
• Tail Drop – drop packets when queues full or nearly full
– TCP global synchronization as all TCP connections "hold
back" simultaneously, and then step forward simultaneously
• RED – random early detection, uses probabilistic
dropping (details on next slide)
– Goal: mark packets at fairly evenly spaced intervals to avoid
global synchronization and avoid biases, and frequently
enough to keep average queue size down
• WRED – RED for multiple queues (each with different
probabilities)
8
RED/WRED
9
RED graphs
p_a
p_b
1.2
0.045
max_p
0.04
1
0.8
0.025
p_b
0.02
0.015
0.01
0.005
p_a
p_b
0.035
0.03
p_b=0.02
0.6
p_b=0.04
0.4
0.2
0
0
min_th
avg
max_th
0
20
40
60
count
10
Multi-Protocol Label Switching
(MPLS)
• emulates some properties of a circuit-switched
network over a packet-switched network.
• 32 bit headers used for routing instead of IP
address (longest prefix matching)
– Popped at each hop
– Has quality of service capabilities
11
MPLS
12
Internet Backbone Core Routers
•
•
•
•
•
•
Cisco CRS-1 (2004)
Cisco 12000 (prev generation)
Juniper T-Series (2004)
Avici TSR (2000)
Foundry XMR (2006)
(many failed companies)
13
Cisco CRS-1
• Cisco’s top end router for the internet
backbone
• “Modular and distributed routing system”
• Scales up to 92 Tbps
• Supports OC768c/STM-256c (40Gbps)
– Fastest link the backbone carries today
• 100 Gbps ready
14
Models
Each slot = 40Gbps
Some math:
4*40Gbps = 160 Gbps,
But they say 320, why?
Fabric shelf – In single
shelf config, all
switching is contained
on cards in this system.
In multi shelf config all
switching is in its own
rack (fabric card shelf)
15
Recall 4 main components
•
•
•
•
Interface Cards
Packet Processing cards
Switch Fabric Cards
Control Plane Cards
16
Cisco CRS-1 example 4 slot shelf
•Interface Cards
•Packet Processing cards
•Switch Fabric Cards
•Control Plane Cards
Route Processor
Switch Fabric
Cards on back
Multi Service Cards
Interface Cards (4 port OC192c/STM-64c)17
Route Processor
•
Performs control plane routing protocols (e.g BGP)
–
–
•
One 1.2 GHz PowerPC or Two 800-MHz Power PC symmetric multiprocessing (SMP)
–
•
CPUs can only communicate through switch fabric as if they were on a separate card.
Connectivity
–
–
–
–
•
Can control any line card on any shelf (recall-you can connect up to 72 shelves)
1 redundant in each shelf
Console port (RJ-45 connector)
Auxiliary port (RJ-45 connector)
One 10/100/1000 Ethernet port (RJ-45 connector)
Two 10/100/1000 Ethernet ports for control plane connectivity
Memory/storage
–
–
–
–
–
4 GB of route memory per processor
64 MB of boot Flash
2 MB of nonvolatile RAM (NVRAM)
One 1-GB PCMCIA card (internal)
One 40-GB hard drive
18
Modular Service Card (MSC)
• The packet processing engine
• 1 for each interface module
• Connected via a midplane (built
into the chassis) to interface cards
and switch fabric cards
• Configurable with 2GB of route
table memory (but the route
processor has 4GB??)
• GB of packet buffer memory per
side (ingress/egress)
• Two SPP – 188 Tensilica CPUs
19
Silicon Packet Processor (SPP)
16 Clusters of 12 PPEs
20
From Eatherton ANCS05
21
From Eatherton ANCS05
22
From Eatherton ANCS05
Switching Fabric
• 3-stage, dynamically self-routed Benes topology
• Before more details, here’s pic of Benes
23
Switching Fabric
•
•
3-stage, dynamically self-routed Benes topology switching fabric
Stage 1 (S1)—Distributes traffic to Stage 2 of the fabric plane. Stage 1 elements receive
cells from the ingress MSC and distribute the cells to Stage 2 (S2) of the fabric plane.
– Cells are distributed to S2 elements in round-robin fashion; one cell goes to the first S2 element,
the next cell goes to the next S2 element, and so on
•
Stage 2 (S2)—Performs switching, provides 2x speedup of cells (two output links for
every input link). Stage 2 elements receive cells from Stage 1 and route them toward the
appropriate:
– egress MSC and PLIM (single-shelf system)
– egress line card chassis (multishelf system)
•
Stage 3 (S3)—Performs switching, provides 2 times (2x) speedup of cells, and performs a
second level of the multicast function. Stage 3 elements receive cells from Stage 2 and
perform the switching necessary to route each cell to the appropriate egress MSC
•
•
Buffering at both S2 and S3
Uses backpressure - carried in cell header
Max 1152 ports?
24
Switch Fabric (some more info)
• 8 Planes + 1 redundant
– Cells sent round robin between planes
• Supports multicast up to 1 million groups
• Separate virtual channels/queues for different
priorities
• Single shelf system, fabric cards contain all 3
stages
• Multi shelf system, fabric cards contain only stage
2, line cards contain stage 1&3
25
“XYZ selects Cisco CRS-1…”
• T-Com (division of Deutsche Telekom)
• KT, Korea's leading service provider
• SOFTBANK BB - for "Yahoo! BB" Super
Backbone
• Telstra – Australia
• Comcast
• China Telecom
• Free (Iliad Group) – Fiber to the home in France
• Lambda National Rail
26
Cisco 12000 (GSR) series
Internal Name: BFR
What about the CRS-1?
6 slot
4 slot
10 slot
16 slot
Depending on model:
2.5 Gbps/slot
10 Gbps/slot
40 Gbps/slot
(so max 1.28 Tbps)
27
Switch Fabric
•
Crossbar switch fabric.
– 2.5Gbps fabric has a 16 x 16 crossbar and uses the ESLIP algorithm for scheduling.
– 10Gbps fabric has a 64 x 64 crossbar and uses multichannel matching algorithm for
scheduling.
– Not sure about 40Gbps
•
64 Byte Cells are used within the switching fabric.
– 8 byte header, 48 byte payload and 8 byte CRC.
– It takes roughly 160 nanoseconds to transmit a cell.
•
•
•
Unicast and Multicast data & routing protocol packets are transmitting over
the fabric.
Multicast packets are replicated within the fabric and transmitted to the
destination line cards by means of partial fufillment. (Busy line cards are sent
copies later when they are not busy).
Local traffic on a line card still has to transit the fabric.
– e.g. a 40Gbps slot could have 4 10Gbps ports
28
http://cisco.cluepon.net
SCA - Scheduler Control ASIC
•
•
•
•
•
•
•
During each clock period (160ns)
Sending line cards send a fabric request to the SCA
SCA runs the ESLIP scheduling algorithm
SCA returns a fabric grant to the line card
Line card responds with a fabric grant accept
SCA sets the crossbar for that cell clock
SCA listens for fabric backpressure to stop scheduling
for a particular line card
29
http://cisco.cluepon.net
Juniper T-series
TX Matrix
-Connects up to 4 T640
-Total 2.56 Tbps
T640
-16 slots (40 Gbps each)
-OC768c
-Total 640 Gbps
T320
-8 slots (40 Gbps each)
-Total 320 Gbps
30
T640
Control Plane Card
Interface Cards
Packet Processing Cards
Switch Fabric Cards
31
Control Plane Card
• 1.6-GHz Pentium IV processor with integrated
256-KB Level 2 cache
• 2-GB DRAM
• 256-MB Compact flash drive for primary storage
• 30-GB IDE hard drive for secondary storage
• 10/100 Base-T auto-sensing RJ-45 Ethernet port
for out-of-band management
• Two RS-232 (DB9 connector) asynchronous serial
ports for console and remote management
32
Packet Processing Card
• L2/L3 Packet Processing ASICs remove Layer 2 packet headers,
segment incoming packets into 64 Byte data cells for internal
processing, reassemble data cells into L3 packets before transmission
on the egress network interface, and perform L2 egress packet
encapsulation.
• A T-Series Internet Processor ASIC performs forwarding table
lookups.
• Queuing and Memory Interface ASICs manage the buffering of data
cells in system memory and the queuing of egress packet notifications.
– Priority queue into switch
• Switch Interface ASICs manage the forwarding of data cells across the
T640 routing node switch fabric.
– Switch interface bandwidth “considerably higher” than network interface
33
Switch Fabric
• For single T640 configuration, uses a 16 port
crossbar (8 slots, each with 2 PFE’s)
– Request, grant
• For flow control and fault detection
– 4 parallel switch planes + 1 redundant plane
• Cell by cell distribution among planes (round robin)
– Sequence numbers and reorder buffer at egress to maintain packet
order
• Fair Bandwidth Allocation (e.g. for when multiple ingress ports
write to same egress port)
• Graceful degradation (if 1 plane fails, just don’t use it)
34
Switch Fabric
• For multiple T640 configuration, uses a Clos switch
(next slide)
– The TX Matrix performs the middle stage
– The 64x64 switch performed with the same 16x16
crossbars as the T640
– 4 switching planes +
1 redundant plane
35
Clos networks
•
3-stage network (m, n, r)
– m = number of middle-staged
switches
– n = number of input ports on
input switches = number o/p
ports on o/p switches
– r = number of input/output
switches
• strictly non-blocking for unicast
traffic iff
–
m >= 2n-1
• Rearrangeably non blocking
– m >= n
What would you expect Juniper’s to be?
36
Avici TSR
• Scales from 40 Gbps to 5
Tbps
• Each rack(14 racks max)
– 40 router module slots
– 4 route controller slots (no
details)
37
Multi Service Connect (MSC)
Line Cards
• Interface Ports
– Up to OC192c
• Packet Processing (lookup)
– Intel IXP 2400 network processor (next slide)
– Meant for 2.5 Gbps processing
• ASIC for QoS
• Switch Fabric
– Router node for the interconnect (in a couple slides)
Note: this is 3 of the 4 main
components on a single board
(which one is missing?)
38
Intel IXP2400
39
Interconnect
• Bill Dally must have had some input (author of a
white paper for Avici)
• Topology
– 3D Folded Torus 2x4x5 (40 nodes) single rack, 14x8x5
(560) maximal configuration
– 10 Gbps links
• Routing – source routing, random selection among
24 minimal paths (limited non-minimal supported)
• Flow Control – 2 virtual channels for each output
port (1120 max), each with their own buffers, one
for best-effort, and one for guaranteed rate traffic
40
Topology
Passive backplane
6x4x5 system (3 racks of 2x4x5)
On right each circle is 5 line cards (in z direction), backplane connects the 4 quadrants, jumpers connect adjacent backplanes, loop back
connectors (jumpers) are placed at edge machines.
* So each line represents 5 bidirectional channels (or 10 unidirectional)
* In a fully-expanded 14x8x5, 560 line card, system, one set of short cables is used to carry the y-dimension channels between two rows
41 of
racks.
Bisection Bandwidth Scaling
•Claim: can upgrade
3D torus 1 line card at
a time (compare to
crossbar, Clos, Benes)
speedup
•Claims Benes can
only double (but Cisco
CRS-1 scales to 1152
nodes)
2x2 x-y bisection constant as z dimension is populated from 2x2x2 to 2x2x5
4x5 y-z bisection constant as x dimension populated from 5x4x5 to 8x4x5
8x5 y-z bisection constant as x dimension populated from 8x8x5 to 14x8x5
?
42
High Path Diversity
• 3D torus has
minimal paths
– 8x8x8 => 90 6 hop paths (avg message, not longest path)
• At least 2 are edge disjoint
• Load balance across paths
– Routing randomly selects among 24 of the paths
– Compare ability to and need to load balance for Crossbar,
Clos?
43
Virtual Networks
• 2 virtual channels per output port (best-effort, guaranteed
bit rate – 33us)
– Max 1120 (14x5x8 torus with 2 per output)
– Separate set of flit buffers at each channel for each virtual channels
• Acts as an output queued crossbar
• Makes torus non-blocking
• Shared physical links
– Never loaded to more than 2/3 due to load balancing and speedup
– 72 Byte flits
– worst-case expected waiting time to access a link is 60ns per hop
44
Foundry NetIron XMR
(cleverly named XMR4000, XMR8000, XMR16000, XMR 32000)
4-, 8-,16-,and 32-slot racks
40 Gbps per slot
(3 Tbps total capacity)
Up to 10 GigE (can be
connected to SONET/SDH
networks, but no built in
optical)
•As of March 2007, they do
offer POS interfaces
Highest single rack
switching capacity
45
Architecture
46
Packet Processing
• Intel or AMCC network processor with offload
• NetLogic NL6000
–
–
–
–
–
–
–
IPv4/IPv6 multilayer packet/flow classification
Policy-based routing and Policy enforcement (QoS)
Longest Prefix Match (CIDR)
Differentiated Services (DiffServ)
IP Security (IPSec)
Server Load Balancing
Transaction verification
47
Switch Fabric
• Clos with “data
striping” (same as
planes)
• Input queuing
– Multiple priority
queues for each output
– 256k virtual queues
• Output “pulls” data
• Supports Multicast
48
Forwarding Tables
Just to give some idea on sizes
• NetIron XMR “Industry leading scalability”
–
–
–
–
–
10 million BGP routes and up to 500 BGP peers
1 million IPv4 routes in hardware (FIB)
240,000 IPv6 routes in hardware (FIB)
2,000 BGP/MPLS VPNs and up to 1 million VPN routes
16,000 VLLs/VPLSes and up to 1 million VPLS MAC
addresses
– 4094 VLANs, and up to 2 million MAC addresses
49
Power Consumption
(again, just to give some idea)
50
Cisco + Juniper > 90%
• Some recent (past 5 years) failed
companies, I couldn’t find any details on
architectures
–
–
–
–
Chiaro
Axiowave
Pluris Inc
Procket (assets bought by Cisco)
51
Software Architectures or
Bus based architectures
52
Click Modular Router
53
General idea
• Extensible toolkit for writing packet processors
• Architecture centered on elements
– Small building blocks
– Perform simple operations e.g. decrement TTL
– Written in C++
• Click routers
– Directed graphs of elements
• comes with library of 300, others contributed many others
– Text files
• Open source
– Runs on Linux and BSD
54
From: Bart Braem, Michael Voorhaen
Click graph
• Elements connected by edges
– Output ports to input ports
• Describes possible packet flows
• FromDevice(eth0)
-> Counter
-> Discard;
55
From: Bart Braem, Michael Voorhaen
Elements
•
•
•
•
•
Class
– element type (reuse!)
Configuration string
– initializes this instance
Input port(s)
– Interface where packets arrive
– Triangles
Output port(s)
– Interface where packets leave
– Squares
Instances can be named
– myTee :: Tee
56
From: Bart Braem, Michael Voorhaen
Push and pull: ports
•
Push port
–
–
–
•
Pull port
–
–
–
•
Filled square or triangle
Source initiates packet transfer
Event based packet flow
Empty square or triangle
Destination initiates packet transfer
Used with polling, scheduling, …
Agnostic port
–
–
Square-in-square or triangle-in-triangle
Becomes push or pull (inner square or triangle filled or empty)
57
From: Bart Braem, Michael Voorhaen
Push and pull: violations
•
Push port
–
–
Has to be connected to push or agnostic port
Conversion from push to pull
•
•
•
With push-to-pull element
E.g. queue
Pull port
–
–
Has to be connected to pull or agnostic port
Conversion from pull to push
•
•
With pull-to-push element
E.g. unqueue
58
From: Bart Braem, Michael Voorhaen
Compound elements
• Group elements in
larger elements
• Configuration with
variables
– Pass configuration to
the internal elements
– Can be anything
(constant, integer,
elements, IP address,
…)
– Motivates reuse
59
From: Bart Braem, Michael Voorhaen
Packets
• Packet consists of
Annotations
Payload
– Payload
• char*
• Access with struct*
– Annotations (metadata to simplify processing)
•
•
•
•
•
“post-it”
IP header information
TCP header information
Paint annotations
User defined annotations
60
From: Bart Braem, Michael Voorhaen
Click scripts
• Text files describing the Click graph
– Elements with their configurations
– Compound elements
– Connections
• src :: FromDevice(eth0);
ctr :: Counter;
sink :: Discard;
src -> ctr;
ctr -> sink;
• FromDevice(eth0)
-> Counter
-> Discard;
61
From: Bart Braem, Michael Voorhaen
Click scripts (cont)
• Input and output ports identified by number (0,1,..)
–
–
–
–
Input port:-> [nr1]Element
->
Output port:
-> Element[nr2]
Both:
-> [nr1]Element[nr2]
If there is only one port: number can be omitted
->
->
• mypackets::IPClassifier(dst host
$myaddr,-);
FromDevice(eth0)
-> mypackets;
mypackets[0]
-> Print(mine)
-> [0]Discard;
mypackets[1]
62
-> Print(“the others”)
-> Discard;
From: Bart Braem, Michael Voorhaen
Compound elements in Click
Scripts
• elementclass DumbRouter {
$myaddr |
mypackets :: IPClassifier(dst host $myaddr,-);
input[0] -> mypackets;
mypackets[0] -> [1]output;
mypackets[1] -> [0]output;
}
u :: dumbrouter(1.2.3.4);
FromDevice(eth0) -> u;
u[0] -> Discard;
u[1] -> ToDevice(eth0);
63
From: Bart Braem, Michael Voorhaen
Running Click
• Multiple possibilities
– Kernel module
• Completely overrides Linux routing
• High speed, requires root permissions
– Userlevel
• Runs as a daemon on a Linux system
• Easy to install and still fast
• Recommended
– nsclick
• Runs as a routing agent within the ns-2 network simulator
• Multiple routers on 1 system
• Difficult to install but less hardware needed
64
From: Bart Braem, Michael Voorhaen
Where Click Is Used
• MIT Roofnet (now Meraki Networks)
– Wireless mesh networks
• Mazu Network
– network monitoring
• Princeton’s VINI
• Software Defined Radio – Univ. of Colorado
• Implemented on NPU (by group at Berkeley),
FPGAs (by Xilinx and Colorado), multiprocessors
(MIT)
65
The End
66