No Slide Title

Download Report

Transcript No Slide Title

Routing Lookups and Packet
Classification:
Theory and Practice
August 18, 2000
Hot Interconnects 8
High Performance
Switching and Routing
Telecom Center Workshop: Sept 4, 1997.
Pankaj Gupta
Department of Computer Science
Stanford University
[email protected]
http://www.stanford.edu/~pankaj
Tutorial Outline
• Introduction
– What this tutorial is about
• Routing lookups
– Background, lookup schemes
• Packet Classification
– Background, classification schemes
• Implementation choices for given
design requirements
2
Request to you
• Please ask lots of questions!
– But I may not be able to answer all of
them right now
• I am here to learn, so please share
your experiences, thoughts and
opinions freely
3
What this tutorial is
about?
4
Internet: Mesh of Routers
The Internet Core
Edge Router
Campus Area Network
5
RFC 1812: Requirements for
IPv4 Routers
• Must perform an IP datagram forwarding
decision (called forwarding)
• Must send the datagram out the
appropriate interface (called switching)
Optionally: a router MAY choose to perform special
processing on incoming packets
6
Examples of special
processing
• Filtering packets for security reasons
• Delivering packets according to a preagreed delay guarantee
• Treating high priority packets
preferentially
• Maintaining statistics on the number
of packets sent by various routers
7
Special Processing Requires
Identification of Flows
• All packets of a flow obey a pre-defined
rule and are processed similarly by the
router
• E.g. a flow = (src-IP-address, dst-IPaddress), or a flow = (dst-IP-prefix,
protocol) etc.
• Router needs to identify the flow of every
incoming packet and then perform
appropriate special processing
8
Flow-aware vs Flow-unaware
Routers
• Flow-aware router: keeps track of
flows and perform similar processing
on packets in a flow
• Flow-unaware router (packet-bypacket router): treats each incoming
packet individually
9
What this tutorial is about:
• Algorithms and techniques that an IP
router uses to decide where to
forward the packets next (routing
lookup)
• Algorithms and techniques that a
flow-aware router uses to classify
packets into flows (packet
classification)
10
Routing Lookups
11
Routing Lookups: Outline
• Background and problem
definition
• Lookup schemes
• Comparative evaluation
12
Lookup in an IP Router
H
E
A
D
E
R
Dstn
Addr
Forwarding Engine
Next Hop Computation
Next
Hop
Forwarding Table
Incoming
Packet
Dstn-prefix
Next Hop
-------
-------
----
----
Unicast destination address based lookup
13
Packet-by-packet Router
Forwarding
Table
Linecard
Forwarding
Decision
Linecard
Forwarding
Table
Routing
processor
Forwarding
Decision
Linecard
Linecard
Interconnect
14
Packet-by-packet Router:
Basic Architectural
Components
Routing
Routing
lookup
Switching
Scheduling
Control
Datapath:
per-packet
processing
15
ATM and MPLS Switches
Direct Lookup
(Port, vci/label)
(Port, vci/label)
Memory
16
IPv4 Addresses
• 32-bit addresses
• Dotted quad notation: e.g. 12.33.32.1
• Can be represented as integers on
the IP number line [0, 232-1]: a.b.c.d
denotes the integer:
(a*224+b*216+c*28+d)
0.0.0.0
IP Number Line
255.255.255.255
17
Class-based Addressing
A
B
C
128.0.0.0
0.0.0.0
D
E
192.0.0.0
Class
Range
MS bits netid
hostid
A
0.0.0.0 – 128.0.0.0
0
bits 1-7
bits 8-31
B
128.0.0.0 191.255.255.255
10
bits 2-15
bits 16-31
C
192.0.0.0 223.255.255.255
110
bits 3-23
bits 24-31
1110
-
-
11110
-
-
D
(multicast)
E (reserved)
224.0.0.0 239.255.255.255
240.0.0.0 255.255.255.255
18
Lookups with Class-based
Addresses netid port#
23
Port 1
186.21
Port 2
Class A
192.33.32.1
Class B
Class C
Exact match
192.33.32 Port 3
19
Problems with Class-based
Addressing
• Fixed netid-hostid boundaries too
inflexible: rapid depletion of address
space
• Exponential growth in size of routing
tables
20
Number of BGP routes advertised
Exponential Growth in Routing
Table Sizes
21
Classless Addressing (and
CIDR)
• Eliminated class boundaries
• Introduced the notion of a variable length
prefix between 0 and 32 bits long
• Prefixes represented by P/l: e.g., 122/8,
212.128/13, 34.43.32/22, 10.32.32.2/32
etc.
• An l-bit prefix represents an aggregation
of 232-l IP addresses
22
CIDR:Hierarchical Route
Aggregation
Backbone routing table
Router
192.2.0/22, R2
R1
R2
Backbone
ISP, P
192.2.0/22
Site, S
Site, T
192.2.1/24
192.2.2/24
R3
R4
ISP, Q
200.11.0/22
192.2.1/24
192.2.2/24
192.2.0/22
200.11.0/22
IP Number Line
23
Number of active BGP prefixes
Size of the Routing Table
Date
Source: http://www.telstra.net/ops/bgptable.html
24
Classless Addressing
Class-based:
A
B
0.0.0.0
C
255.255.255.255
Classless:
191.23.14/23
191.23/16 191.128.192/18
23/8
0.0.0.0
191/8
255.255.255.255
25
Non-aggregatable Prefixes:
(1) Multi-homed Networks
Backbone routing table
192.2.2/24, R3
192.2.0/22, R2
R1
Router
Backbone
R2
192.2.0/22
R3
ISP, P
192.2.2/24
R4
26
Non-aggregatable Prefixes:
(2) Change of Provider
Backbone routing table
192.2.2/24, R3
192.2.0/22, R2
Router
Backbone
R2
R1
ISP, P
192.2.0/22
Site, S
Site, T
192.2.1/24
192.2.2/24
R4
R3
ISP, Q
200.11.0/22
192.2.1/24
192.2.2/24
192.2.0/22
200.11.0/22
IP Number Line
27
Routing Lookups with CIDR
192.2.2/24
192.2.2/24, R3
192.2.0/22, R2
200.11.0/22, R4
192.2.0/22
192.2.0.1
192.2.2.100
200.11.0/22
200.11.0.33
Find the most specific route, or the longest matching
prefix among all the prefixes matching the destination
address of an incoming packet
28
Longest Prefix Match is
Harder than Exact Match
• The destination address of an
arriving packet does not carry with it
the information to determine the
length of the longest matching prefix
• Hence, one needs to search among
the space of all prefix lengths; as well
as the space of all prefixes of a given
length
29
Metrics for Lookup Algorithms
•
•
•
•
•
•
Speed
Storage requirements
Low update time
Ability to handle large routing tables
Flexibility in implementation
Low preprocessing time
30
Maximum Bandwidth per
Installed Fiber
100000
2x per year
Single fiber capacity (Gb/s)
10000
1000
100
10
1
0.1
0.01
1980
1985
1990
1995
2000
2005
Y ear
Source: Lucent
31
Maximum Bandwidth per
Router Port, and Lookup
Performance Required
Year
Line
Linerate 40B
84B
354B
(Gbps)
(Mpps) (Mpps) (Mpps)
1997-98 OC3
1998-99 OC12
1999-00 OC48
0.155
0.622
2.5
0.48
1.94
7.81
0.23
0.92
3.72
0.054
0.22
0.88
2000-01 OC192
10.0
31.25
14.88
3.53
125
3.13
59.52
1.49
14.12
0.35
2002-03 OC768 40.0
1GE
1.0
32
Size of Routing Table?
• Currently, 85K entries
• At 25K per year, 230-256K prefixes
for next 5 years
• Decreasing costs of transmission may
increase rate of routing table growth
• At 50K per year, need 350-400K
prefixes for next 5 years
33
Routing Update Rate?
• Currently a peak of a few hundred BGP
updates per second
• Hence, 1K per second is a must
• 5-10K updates/second seems to be safe
• BGP limitations may be a bottleneck first
• Updates should be atomic, and should
interfere little with normal lookups
34
Routing Lookups: Outline
• Background and problem
definition
• Lookup schemes
• Comparative evaluation
35
Example Forwarding Table
(5-bit Prefixes)
Prefix Next-hop
P1
P2
P3
P4
111*
10*
1010*
10101
H1
H2
H3
H4
36
Linear Search
• Keep prefixes in a linked list
• O(N) storage, O(N) lookup time, O(1)
update complexity
• Improve average time by keeping
linked list sorted in order of prefix
lengths
37
Caching Addresses
Slow Path
Buffer
Memory
CPU
Fast Path
DMA
DMA
DMA
Line
Card
Local
Buffer
Memory
Line
Card
Local
Buffer
Memory
Line
Card
Local
Buffer
Memory
MAC
MAC
MAC
38
Caching Addresses
Advantages
Disadvantages
Increased average lookup
performance
Decreased locality in
backbone traffic
Cache size
Cache management
overhead
Hardware implementation
difficult
39
Radix Trie
Trie node
A
1
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
Lookup 10111
C
P2
G
P3
next-hop-ptr (if prefix)
right-ptr
left-ptr
B
1
0
1
D
1
E
0
1
Add P5=1110*
0
P4
H
P5
P1
F
I
40
Radix Trie
• W-bit prefixes: O(W) lookup, O(NW)
storage and O(W) update complexity
Advantages
Disadvantages
Simplicity
Worst case lookup slow
Wastage of storage space in
Extensible to wider fields
chains
41
Leaf-pushed Binary Trie
Trie node
A
1
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
C
P2
G
B
1
0
1
0
left-ptr or
next-hop
right-ptr or
next-hop
D
P1
E
P2
P3 P4
42
PATRICIA
B
D
2
0
3
0
1
A
Patricia tree internal node
1
P1
E
C
bit-position
right-ptr
left-ptr
5
P2
F
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
0
P3
1
P4
G
Lookup 10111
43
PATRICIA
• W-bit prefixes: O(W2) lookup, O(N)
storage and O(W) update complexity
Advantages
Disadvantages
Decreased storage
Worst case lookup slow
Backtracking makes
implementation complex
Extensible to wider fields
44
Path-compressed Tree
Lookup 10111
1, , 2
0
0
10,P2,4
1
P4
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
1
P1
C
D
1010,P3,5
P1
B
A
E
Path-compressed tree node structure
variable-length next-hop (if
prefix present)
bitstring
left-ptr
bit-position
right-ptr
45
Path-compressed Tree
• W-bit prefixes: O(W) lookup, O(N)
storage and O(W) update complexity
Advantages
Disadvantages
Decreased storage
Worst case lookup slow
46
Early Lookup Schemes
• BSD unix [sklower91] : Patricia,
expected lookup time = 1.44logN
• Dynamic prefix trie [doeringer96] :
Patricia variant, complex
insertion/deletion : 40K entries
consumed 2MB with 0.3-0.5 Mpps
47
Multi-bit Tries
W
Binary trie
Depth = W
Degree = 2
Stride = 1 bit
Multi-ary trie
W/k
Depth = W/k
Degree = 2k
Stride = k bits
48
Prefix Expansion with Multibit Tries
If stride = k bits, prefix lengths that
are not a multiple of k need to be
expanded
E.g., k = 2:
Prefix
Expanded prefixes
0*
00*, 01*
11*
11*
Maximum number of expanded prefixes
corresponding to one non-expanded prefix = 2k-1
49
Four-ary Trie (k=2)
A
10
P1
111*
H1
P2
10*
H2
P3
1010*
H3
P4
10101
H4
11
B
P2
D
P3
10
G
Lookup 10111
C
10
10
E
P11
11
11
P41
P42
F
P12
H
A four-ary trie node
next-hop-ptr (if prefix)
ptr00 ptr01 ptr10 ptr11
50
Compressed Trie (k=8)
Only 4 memory accesses!
L8
8-8-8-8 split
L16
L24
L32
51
Prefix Expansion Increases
Storage Consumption
• Replication of next-hop ptr
• Greater number of unused (null)
pointers in a node
Time ~ W/k
Storage ~ NW/k * 2k-1
52
Generalization: Different
Strides at Each Trie Level
•
•
•
•
16-8-8 split
4-10-10-8 split
24-8 split
21-3-8 split
53
Choice of Strides: Controlled
Prefix Expansion [Sri98]
Given a forwarding table and a desired
number of memory accesses in the worst
case (i.e., maximum tree depth, D)
A dynamic programming algorithm to compute
the optimal sequence of strides that minimizes
the storage requirements: runs in O(W2D) time
Advantages
Disadvantages
Optimal storage under
these constraints
Updates lead to suboptimality anyway
Hardware implementation
difficult
54
Further Generalization:
Different Stride at Each
Node [Sri98]
Given a forwarding table and a desired
number of memory accesses in the worst
case (i.e., maximum tree depth, D)
A dynamic programming algorithm to compute
the optimal stride at each node that minimizes
the storage requirements: runs in O(NW2D) time
55
Stride Optimization :
Implementation Results
Fixedstride
Varyingstride
Two levels
Three levels
49 MB, 1ms
1.8 MB, 1ms
1.6MB, 130 ms
0.57 MB, 871 ms
38816 prefixes, 300 MHz P-II
56
Lulea Algorithm [lulea98]
16-8-8 split
L16
L24
L32
57
Lulea Algorithm
1 0 0 0
1
0
1
1 1 0 0 0 1
1
1
1
16-8-8 split
58
Lulea Algorithm
Codeword array
10001010 11100010 10000010 10110100
R1, 0
0
R2, 3
1
R3, 7
2
Base index array
R4, 9
R5, 0
3
4
0
13
0
Pointer array
11000000
1
P1
P2
P3
P4
59
Lulea Algorithm
33K entries: 160KB, average 2Mpps
Advantages
Disadvantages
Extremely small data
structure – can fit in L1/L2
cache
Scalability to larger tables?
Incremental updates not
supported
60
Binary Search on Trie Levels
[wald98]
P
61
Binary Search on Trie Levels
Example prefixes
10/8
10.1/16
Prefixlength
8
10.1.10/22
12
10.1.32/22
16
10.2.64/22
Hashtable
ptr
10
10.1, 10.2
22
Example addresses
10.1.10.4
10.1.10, 10.1.32, 10.2.64
10.2.3.9
62
Binary Search on Trie Levels
33K entries: 1.4 MB, 1.2-2.2 Mpps
Advantages
Disadvantages
Scales nicely to IPv6
Multiple hashed memory
accesses
Incremental updates complex
63
Binary Search on Prefix
Intervals [lampson98]
P2
0000
I1
P5
I2
0010
P3
P1
0100
I3
P4
0110
1000
I4
1010
Prefix
Interval
P1
*
0000-1111
P2
00*
0000-0011
P3
1*
1000-1111
P4
1101
1101-1101
P5
001*
0010-0011
1100
I5 I6
1110 1111
64
Alphabetic Tree

I3

I2
I1
0010
0100
I6
I5
P4
P3
P1
I3
0110
>
>
I4
P5
I2
1101
1100
>
P2
0000

>
0001
I1
>
0011


0111
1000
I4
1010
1100
I5 I6
1110 1111
65
Multiway Search on Intervals
38K entries: 0.95 MB, 2.1 Mpps
Advantages
Disadvantages
Space is O(N)
Incremental updates complex
66
Depth-constrained Nearoptimal Alphabetic Tree
• Redraw the binary search tree based
on probability of access of routing
table entries:
– Minimize average lookup time
– But keep worst case lookup time bounded
40% improvement in lookup time with a small relaxation in worst
case lookup time.
67
Routing Lookups in Hardware
[gupta98]
Number
April 11, 2000
Prefix length
MAE-EAST routing table (source: www.merit.edu)
68
Routing Lookups in Hardware
Prefixes up to 24-bits
224 = 16M entries
142.19.6
1
142.19.6.14
142.19.6
Next Hop
Next Hop
24
69
Routing Lookups in Hardware
Prefixes up to 24-bits
128.3.72
14
0
Pointer
8
Next Hop
base
128.3.72
24
1 Next Hop
offset
128.3.72.14
Prefixes above
24-bits
NextHop
Hop
Next
70
Routing Lookups in Hardware
Prefixes up to n-bits
2n entries:
0
n
i
i

m
2
 entries
j
Prefixes
longer than
n+m bits
Next Hop
n+m
71
Routing Lookups in Hardware
Various compression schemes can be employed to
decrease the storage requirements: e.g. employ carefully
chosen variable length strides, bitmap compression etc.
Advantages
Disadvantages
20 Mpps with 50ns DRAM
or 66 Mpps with e-DRAM
Easy to implement in
hardware
Large memory required (9-33
MB)
Depends on prefix-length
distribution
72
Content-addressable Memory
(CAM)
• Fully associative memory
• Exact match operation in a single
clock cycle: parallel compare
73
Lookups with Ternary-CAM
TCAM
RAM
0
0
1
P32 1
2
3
0
0
Destination Memory
Address
M
P31
array
Priority
encoder
Next-hop
memory
Next-hop
P8 1
74
Lookups with TCAM
Advantages
Disadvantages
Fast: 15-20 ns
Expensive (and low density):
0.25 MB at 50 MHZ costs
$30-$75
High power: 5-8 W
Updates slow
75
Updates with TCAM
0
1
P32
2
3
P31
P8
Empty space
M
Issue: how to manage the free space : [Hoti’00]
76
Routing Lookups: Outline
• Background and problem
definition
• Lookup schemes
• Comparative evaluation
77
Performance Comparison:
Complexity
Algorithm
Lookup Storage Update
Binary trie
W
NW
W
Patricia
W2
N
W
Path-compressed trie
W
N
W
Multi-ary trie
W/k
N*2k
-
LC trie
W
N
-
Lulea
-
-
-
Binary search on trie levels
logW
NlogW
-
Binary search on intervals
log(2N)
N
-
TCAM
1
N
W
78
Performance Comparison
Algorithm
Lookup
(ns)
Storage
(KB)
Patricia (BSD)
2500
3262
Multi-way fixed-stride optimal
trie (3-levels)
298
1930
Multi-way fixed-stride optimal
trie (5-levels)
428
660
LC trie
-
700
Lulea
409
160
Binary search on trie levels
650
1600
6-way search on intervals
490
950
Lookups with direct access
15-60
9-33 * 1000
TCAM
15-20
512
79
Routing Lookups: References
• [lulea98] A. Brodnik, S. Carlsson, M. Degermark, S.
Pink. “Small Forwarding Tables for Fast Routing
Lookups”, Sigcomm 1997, pp 3-14.
• [gupta98] P. Gupta, S. Lin, N.McKeown. “Routing
lookups in hardware at memory access speeds”,
Infocom 1998, pp 1241-1248, vol. 3.
• P. Gupta, B. Prabhakar, S. Boyd. “Near-optimal
routing lookups with bounded worst case
performance,” Proc. Infocom, March 2000
• [lampson98] B. Lampson, V. Srinivasan, G.
Varghese. “ IP lookups using multiway and
multicolumn search”, Infocom 1998, pp 1248-56,
vol. 3.
80
Routing lookups : References
(contd)
• [wald98] M. Waldvogel, G. Varghese, J. Turner, B.
Plattner. “Scalable high speed IP routing lookups”,
Sigcomm 1997, pp 25-36.
• [LC-trie] S. Nilsson, G. Karlsson. “Fast address
lookup for Internet routers”, IFIP Intl Conf on
Broadband Communications, Stuttgart, Germany,
April 1-3, 1998.
• [sri98] V. Srinivasan, G.Varghese. “Fast IP lookups
using controlled prefix expansion”, Sigmetrics,
June 1998
• TCAM vendors: netlogicmicro.com, laratech.com,
mosaid.com, sibercore.com
81
Packet Classification
82
Packet Classification: Outline
• Background and problem
definition
• Classification schemes
• Comparative evaluation
83
Flow-aware vs Flow-unaware
Routers (recap)
• Flow-aware router: keeps track of
flows and perform similar processing
on packets in a flow
• Flow-unaware router (packet-bypacket router): treats each incoming
packet individually
84
Why Flow-aware Router?
ISPs want to provide differentiated
services
 Routers require additional mechanisms:
admission control, resource reservation, per-flow queueing,
fair scheduling etc.
classification
 capability to distinguish and isolate
traffic belonging to different flows
based on negotiated service agreements
Rules or policies
85
Need for Differentiated
Services
Y
E2
E1
Z
ISP3
NAP
ISP1
X
ISP2
Service Example
Traffic
Shaping
Ensure that ISP3 does not inject more than 50Mbps of total
traffic on interface X, of which no more than 10Mbps is email
traffic
Packet
Filtering
Deny all traffic from ISP2 (on interface X) destined to E2
Policy
Routing
Send all voice-over-IP traffic arriving from E1 (on interface Y)
and destined to E2 via a separate ATM network
86
More Valueadded Services
• Differentiated services
– Regard traffic from Autonomous System #33
as `platinumgrade’
• Accounting and Billing
– Treat all video traffic as highest priority and
perform accounting for this type of traffic
• Committed Access Rate (rate limiting)
– Rate limit WWW traffic from subinterface#739 to 10Mbps
87
Multi-field Packet
Classification
Field 1
Field 2
…
Field k Action
Rule 1
5.3.90/21
2.13.8.11/32
…
UDP
A1
Rule 2
5.168.3/24
152.133/16
…
TCP
A2
…
…
…
…
…
…
Rule N
5.168/16
152/8
…
ANY
AN
Given a classifier with N rules, find the action associated
with the highest priority rule matching an incoming
packet.
Example: packet (5.168.3.32, 152.133.171.71, …, TCP)
88
Packet Header Fields for
Classification
Direction of transmission of packet
PAYLOAD L4-SP L4-DP L4-PROT L3-SA L3-DA L3-PROT L2-SA L2-DA
Transport layer header
DA = Destination Address
SA = Source Address
PROT = Protocol
SP = Source port
DP = Destination port
Network layer header MAC header
L2 = layer 2 (e.g., Ethernet)
L3 = layer 3 (e.g., IP)
L4 = layer 4 (e.g., TCP)
89
Flow-aware Router: Basic
Architectural Components
Routing, resource reservation,
admission control, SLAs
Special
Switching
Routing Packet
lookup classificati processing
Scheduling
on
Control
Datapath:
per-packet
processing
90
Packet Classification
H
E
A
D
E
R
Incoming
Packet
Forwarding Engine
Packet Classification
Action
Classifier (policy database)
Predicate
Action
-------
-------
----
---91
Packet Classification: Problem
Definition
Given a classifier C with N rules, Rj, 1  j  N, where Rj
consists of three entities:
1) A regular expression Rj[i], 1  i  d, on each of the d header
fields,
2) A number, pri(Rj), indicating the priority of the rule in the
classifier, and
3) An action, referred to as action(Rj).
For an incoming packet P with the header considered as a d-tuple of
points (P1, P2, …, Pd), the d-dimensional packet classification
problem is to find the rule Rm with the highest priority among all
the rules Rj matching the d-tuple; i.e., pri(Rm) > pri(Rj),  j  m, 1 
j  N, such that Pi matches Rj[i], 1  i  d. We call rule Rm the best
matching rule for packet P.
92
Example 4D classifier
Rule L3-DA
L3-SA
L4-DP
L4PROT
Action
R1
152.163.190.69/2
55.255.255.255
152.163.80.11/2
55.255.255.255
*
*
Deny
R2
152.168.3/255.25
5.255
152.163.200.157
/255.255.255.2
55
eq www
udp
Deny
R3
152.168.3/255.25
5.255
152.163.200.157
/255.255.255.2
55
range 2021
udp
Permit
R4
152.168.3/255.25
5.255
152.163.200.157
/255.255.255.2
55
eq www
tcp
Deny
R5
*
*
*
*
Deny
93
Example Classification Results
Pkt L3-DA
Hdr
L3-SA
L4-DP
L4PROT
Rule,
Action
P1
P2
152.163.190.69
152.163.80.11
www
tcp
R1, Deny
152.168.3.21
152.163.200.157
www
udp
R2, Deny
94
Classification is a
Generalization of Lookup
•
•
•
•
•
•
Classifier = routing table
One-dimension (destination address)
Rule = routing table entry
Regular expression = prefix
Action = (next-hop-address, port)
Priority = prefix-length
95
Metrics for Classification
Algorithms
•
•
•
•
•
•
•
•
Speed
Storage requirements
Low update time
Ability to handle large classifiers
Flexibility in implementation
Low preprocessing time
Scalability in the number of header fields
Flexibility in rule specification
96
Size of Classifier?
• Microflow recognition: 128K-1M flows
in a metro/edge router
• Firewall applications, 8-16K
• Wildcarded filters, 16-128K
• Depends heavily on where your box
will be deployed
97
Packet Classification: Outline
• Background and problem
definition
• Classification schemes
• Comparative evaluation
98
Example Classifier
Rule
R1
R2
R3
R4
R5
R6
R7
Destination
Address
0*
0*
0*
00*
00*
10*
*
Source
Address
10*
01*
1*
1*
11*
1*
00*
99
Set-pruning Tries [Tsuchiya,
Sri98]
Dimension DA
1
0
0
0
Rule
DA SA
R1
0*
10*
R2
0*
01*
R3
0*
1*
R4
00*
1*
R5
00*
11*
R6
10*
1*
R7
*
00*
O(N2) memory
R4
R7
R2 R1 R5 R7
R3
R2 R1
R6
R7
R7
Dimension SA
100
Grid-of-Tries [Sri98]
Dimension DA
1
0
0
0
DA SA
R1
0*
10*
R2
0*
01*
R3
0*
1*
R4
00*
1*
R5
00*
11*
R6
10*
1*
R7
*
00*
O(NW) memory
O(W2) lookup
R3
R4
Rule
R6
R5
R2 R1
R7
Dimension SA
101
Grid-of-Tries [Sri98]
1
Dimension DA
0
0
0
0
R5
0
R2 R1
DA SA
R1
0*
10*
R2
0*
01*
R3
0*
1*
R4
00*
1*
R5
00*
11*
R6
10*
1*
R7
*
00*
O(NW) memory
O(2W) lookup
0
R4
Rule
R3
R6
R7
Dimension SA
102
Grid-of-Tries
20K entries: 2MB, 9 memory accesses (with expansion)
Advantages
Disadvantages
Good solution for two
dimensions
Static solution
Not easily extensible to
more than two dimensions
103
Geometric Interpretation in 2D
Dimension #2
R7
R6
P1
P2
R3
e.g. (144.24/16, 64/24)
e.g. (128.16.46.23, *) R1
R5
R4
R2
Dimension #1
104
Bitmap-intersection [Lak98]
0
1
1
1
R4
R3
R2
R1
1
1
0
0
R4
R3
R2
R1
105
Bitmap-intersection
512 rules: 1Mpps with single FPGA (33MHz) and five 1Mb
SRAM chips
Advantages
Disadvantages
Good solution for multiple
dimensions, for small
classifiers
Static solution
Large memory bandwidth
(scales linearly in N)
Large amount of memory
(scales quadratically in N)
Hardware-optimized
106
2D classification [Lak98]
Prefixes of
length 3
Ranges
R7
R4
R3
R1
R2
R6
P1
R5
Prefixes of
length 4
Prefixes
107
2D Classification [Lak98]:
Preprocessing
• Store the prefixes in a trie
• With each prefix store the set of
intervals that form a rectangle with
that prefix as the other side
• Store the intervals by storing them
as a set of non-overlapping disjoint
intervals
108
2D Classification [Lak98]:
Lookup
• For each prefix length:
– Find the prefix matching the incoming
point and the set of non-overlapping
intervals associated with the prefix
– Search for the non-overlapping interval
that contains the point
• Repeat for all prefix lengths
109
2D Classification [Lak98]:
Complexity
• Lookups: O(WlogN) with N twodimensional rules
– O(W+logN) using fractional cascading
• Space: O(N)
• Static data structure
110
Crossproducting [Sri98]
(8,4)
6
5
4
P2
P1
(1,3)
3
R4
R3
2
3
R1
R2
2
1
1
4
5
6 7
8
9
111
Crossproducting
Need: d 1-D lookups + 1 memory access, O(Nd) space
50 rules: 1.5MB, need caching (on-demand crossproducting)
for bigger classifiers
Advantages
Disadvantages
Fast accesses
Suitable for multiple
fields
Large amount of memory
Need caching for bigger
classifiers (> 50 rules)
112
Space-time Tradeoff
Point Location among N non-overlapping
regions in d dimensions:
either
O(log N) time with O(Nd) space, or
O(logd-1N) time with O(N) space
Need help: exploit structure in real-life classifiers.
113
Recursive Flow Classification
[Gupta99]
Observations:
• Difficult to achieve both high
classification rate and reasonable
storage in the worst case
• Real classifiers exhibit structure and
redundancy
• A practical scheme could exploit this
structure and redundancy
114
RFC: Classifier Dataset
• 793 classifiers from 101 ISP and
enterprise networks with a total of 41505
rules.
• 40 classifiers: more than 100 rules. Biggest
classifier had 1733 rules.
• Maximum of 4 fields per rule: source IP
address, destination IP address, protocol
and destination port number.
115
Structure of the Classifiers
4 regions
R3
R2
R1
116
Structure of the Classifiers
7 regions
R3
{R2, R3}
R2
R1
{R1, R2}
{R1, R2, R3}
dataset: 1733 rule classifier = 4316 distinct
regions (worst case is 1013 !)
117
Recursive Flow Classification
One-step
2S = 2128
2T = 212
Multi-step
2S = 2128
264
232
2T = 212
118
Chunking of a Packet
Chunk #0
Source L3 Address
Destination L3 Address
L4 protocol and flags
Source L4 port
Destination L4 port
Chunk #7
Type of Service
Packet Header
119
Packet Flow
16
index
8
Reduction
action
14
16
16
Header
128
8
8
Phase 0
Combination
16
64
Phase 1
32
Phase 2
16
Phase
120 3
Choice of Reduction Tree
0
0
1
1
2
2
3
4
3
4
5
5
Number of phases = P = 3
10 memory accesses
Number of phases = P = 4
11 memory acceses
121
RFC: Storage Requirements
Number of Rules
122
RFC: Classification Time
• Pipelined hardware: 30 Mpps (worst
case OC192) using two 4Mb SRAMs
and two 64Mb SDRAMs at 125MHz.
• Software: (3 phases) 1 Mpps in the
worst case and 1.4-1.7 Mpps in the
average case. (average case OC48)
[performance measured using Intel Vtune simulator on a
windows NT platform]
123
RFC: Pros and Cons
Advantages
Disadvantages
Exploits structure of
real-life classifiers
Suitable for multiple
fields
Supports non-contiguous
masks
Fast accesses
Depends on structure of
classifiers
Large pre-processing time
Incremental updates slow
Large worst-case storage
requirements
124
Hierarchical Intelligent
Cuttings (HiCuts) [Gupta99]
Observations:
• No single good solution for all cases
– But real classifiers have structure
• Perhaps an algorithm can exploit this
structure
– A heuristic hybrid scheme …
125
HiCuts: Basic Idea
{R1, R2, R3, …, Rn}
Decision Tree
{R1, R3,R4}
{R1, R2,R5}
{R8, Rn}
Binth: BinThreshold = Maximum Subset Size = 3
126
Heuristics to Exploit
Classifier Structure
• Picking a suitable dimension to hicut across
• Minimize the maximum number of rules into any one
partition, OR
• Maximize the entropy of the distribution of rules
across the partition, OR
• Maximise the different number of specifications in
one dimension
• Picking the suitable number of partitions
(HiCuts) to be made
• Affects the space consumed and the classification
time. Tuned by a parameter, spfac
127
HiCuts:Number of Memory
Accesses
Crossproducting
Number of Rules
(log scale)
Binth = 8, spfac = 4
128
Space in KBytes (log
scale)
HiCuts: Storage Requirements
Number of Rules
(log scale)
Binth = 8 ; spfac = 4
129
Time in seconds (log
scale)
Incremental Update Time
Number of Rules
(log scale)
Binth = 8, spfac = 4 , 333MHz P-II running Linux
130
HiCuts: Pros and Cons
Advantages
Disadvantages
Exploits structure of
real-life classifiers
Adapts data structure
Suitable for multiple
fields
Supports incremental
updates
Depends on structure of
classifiers
Large pre-processing time
Large worst-case storage
requirements
131
Tuple Space Search [Suri99]
Decompose the classification problem
into a number of exact match
problems, then use hashing
Rule
Tuple
R1 (01*, 111*)
[2,3]
R2 (11*, 010*)
[2,3]
R3 (1*, *)
[1,0]
Use one hash table
for each tuple, search
all hash tables
sequentially
132
Improved TSS via
Precomputation
• Extension of “binary search on trie
levels”
• If [2,3,3] succeeeds, no need to
search e.g., [4,5,6]
• If [2,3,3] fails, no need to search
e.g., [1,2,1]
• Search the tuple space intelligently
(decision tree on tuple space)
133
TSS: Pros and Cons
Advantages
Disadvantages
Suitable for multiple
fields
Supports incremental
updates
Fast classification and
updates on average
Large pre-processing time
Multiple hashed-memory
accesses
134
Area-based Quad Tree
Crossing
[Buddhikot99]
Filter Set
R1,R2
R5
00
01
10
11
R3,R4
R3
R4
R2
R1
P1
00
01
10
11
R5
Lookup: two 1-D longest prefix match
operations at every node in the path
from the root to a leaf
O(N) space
O(WlogN) lookup time
O(W+logN) using FC
135
AQT: Efficient Updates
old
new
Partition prefixes into groups and do pre-computation
per group instead of per interval
O(aW) search and O(aN1/a) updates
136
2-D Classification Using FIS
Tree [Feldmann00]
R5
R3
R4
R2
R1
x-FIS tree
P1
l levels
O(ln1+1/l) space
(l+1) 1-D lookups
137
FIS Tree: Experimental
Study
Number
of rules
4-60 K
Levels in Storage Number of
FIS tree space
memory
accesses
2
< 5 MB
< 15
~106
3
< 100 MB < 18
Rulesets constructed using netflow data from AT&T
Worldnet. Experiments done using static 2-D FIS trees.
138
Ternary CAMs
Advantages
Disadvantages
Suitable for multiple
fields
Fast: 16-20 ns (50-66
Mpps)
Simple to understand
Inflexible: range-to-prefix
blowup
Density: largest available in
2000 is 32K x 128 (but can be
cascaded)
Management software, and
on-chip logic: non-trivial
complexity
Power: 5-8 W
Incremental updates: slow
DRAM-based CAMs: higher
density but soft-error is a
problem
Cost: $30-$160 for 1Mb
139
Range-to-prefix Blowup
Maximum memory blowup = factor of (2W-2)d
RuleRule
Range
Range
Maximal Prefixes
R1 R5[3,11]
[3,11]
0011, 01**, 10**
R2 R4[2,7][2,7]
001*, 01**
[4,11]
R3 R3[4,11]
01**, 10**
R4 R2[4,7][4,7]
01**
R1
[1,15]
R5
[1,14]
0001, 001*, 01**, 10**, 110*,
1110
140
Packet Classification:
References
• [Lak98] T.V. Lakshman. D. Stiliadis. “High speed
policy based packet forwarding using efficient
multi-dimensional range matching”, Sigcomm 1998,
pp 191-202
• [Sri98] V. Srinivasan, S. Suri, G. Varghese and M.
Waldvogel. “Fast and scalable layer 4 switching”,
Sigcomm 1998, pp 203-214
• [Suri99] V. Srinivasan, G. Varghese, S. Suri. “Fast
packet classification using tuple space search”,
Sigcomm 1999, pp 135-146
• [Gupta99] P. Gupta, N. McKeown, “Packet
classification using hierarchical intelligent
cuttings,” Hot Interconnects VII, 1999
141
Packet Classification:
References (contd.)
• [Gupta99] P. Gupta, N. McKeown, “Packet
classification on multiple fields,” Sigcomm 1999, pp
147-160
• [Buddhikot99] M. M. Buddhikot, S. Suri, and M.
Waldvogel, “Space decomposition techniques for
fast layer-4 switching,” Protocols for High Speed
Networks, vol. 66, no. 6, pp 277-283, 1999
• [Feldmann00] A. Feldmann and S. Muthukrishnan,
“Tradeoffs for packet classification,” Infocom
2000
• T. Woo, “A modular approach to packet
classification: algorithms and results, “ Infocom
2000
142
Special Instances of
Classification
• Multicast
– PIMSM
– Longest Prefix Matching on the source and group
address
– Try (S,G) followed by (*,G) followed by (*,*,RP)
– Check Incoming Interface
– DVMRP:
– Incoming Interface Check followed by (S,G) lookup
• IPv6
– 128bit destination address field
143
Implementation Choices
Given Design Requirements
Disclaimer: These are my opinions
144
Design Requirement LU1
Requirements:
2.5 Gbps, 100K routes
Choices:
a) 2-4 TCAMs
b) On-chip logic with one external SDRAM
chip (using multibit tries)
c) On-chip e-DRAM
145
Design Requirement LU2
Requirements:
10 Gbps, 256K routes
Choices:
a) 4-8 TCAMs
b) On-chip logic with 2-4 external SDRAM
chips (using multibit tries)
c) On-chip e-DRAM
146
Design Requirement PC1
Requirements:
10 Gbps classification up to L4, 16-64K
comparatively static 128bit entries
Choices:
a) 1-4 TCAMs
b) On-chip logic with 2 external SDRAM
and 2 SRAM chips (using RFC)
c) Off-chip SRAMs (using HiCuts)
147
Your Design Here
Requirements:
Choices:
148
Lookup/Classification Chip
Vendors
•
•
•
•
•
•
Switch-on
Fastchip
Agere
Solidum
Siliconaccess
TCAM vendors: Netlogic, Lara,
Sibercore, Mosaid, Klsi etc.
149
Summary
• Both problems are well studied by
now but increasing linerates and
database sizes continue to present
interesting opportunities
• Still need a high-speed (~OC192)
dynamic, generic, multi-field
classification algorithm for large
number of (up to a million) rules
150
Thanks!
I will appreciate direct
feedback at
[email protected]
151