Data plane algorithms in routers

Download Report

Transcript Data plane algorithms in routers

Data Plane Algorithms in
Network Processing Systems
Lec. 24/25: from prefix lookup to deep packet inspection
Slides adapted from Cristian Estan, University of Wisconsin-Madison
Contents from George Varghese: network algorithmics and ACM/IEEE
papers
What is the data plane?

The part of the router handling the traffic




Throughput defined as number of packets or bytes
handled per second is very important



“Line speed” – keeping up with the rate at which traffic can
be transmitted over the wire or fiber
Example: 10Gbps router has 32 ns to handle 40 byte packet
Memory usage limited by technology and costs


Data plane algorithms applied to every packet
Successive packets typically treated independently
Example: deciding on which link to send a packet
Can afford at most tens of megabits of fast on-chip memory
Network Processor is one of executing
engines of data plane of router
A generic data plane problem




Router has many directives composed of a guard,
and an associated action (all guards distinct)
There is a simple procedure for testing how well a
guard matches a packet
For each packet, find the guard that matches “best”
and take the associated action
Example – routing table lookup:



Each guard is an IP prefix (between 0 and 32 bits)
Matching procedure: is the guard a prefix of the 32 bit
destination IP address
“Best” defined as longest matching prefix
The rules of the game



Matching against all guards in sequence is too slow
We build a data structure that captures the
semantics of all guards and use it for matching
Primary metrics




How fast the matching algorithm is
How much memory the data structure needs
Time to build data structure also has some importance
We can cheat (but we won’t today) by:


Using binary or ternary content-addressable memories
Using other forms of hardware support
Measuring “algorithm complexity”

Execution cost measured in number of memory
accesses to read data structure



Actual data manipulation operations typically very simple
On some platforms we can read wide words
Worst case performance most important



Worst case defined with respect to input, not guards
Caching has been proven ineffective for many settings
Using algorithms with good amortized complexity, but bad
worst case requires large buffers
Overview

Longest matching prefix

Trie-based algorithms








Uni-bit and multi-bit tries (fixed stride and variable stride)
Leaf pushing
Bitmap compression of multi-bit trie nodes
Tree bitmap representation for multi-bit trie nodes
Binary search on ranges
Binary search on prefix lengths
Classification on multiple fields
Signature matching
Longest matching prefix






Used in routing table lookup (a.k.a. forwarding) for
finding the link on which to send a packet
Guard: a bit string of 0 to w bits called IP prefix
Action: a single byte interface identifier
Input: a w-bit string representing the destination IP
address of the packet (w is 32 for IPv4,128 for IPv6)
Output: the interface associated with the longest
guard matching the input
Size of problem: hundreds of thousands of prefixes
Controlled prefix expansion with stride 3
P1
P1
P1
P1
P2
P2
P2
P2
P3
P4
P4
P4
P4
P5
P6
P7
P8
P8
P9
Routing table
P1 0*
P2 1*
P3 100*
P4 1000*
P5 100000*
P6 101*
P7 110*
P8 11001*
P9 111*
Uni-bit trie
0 P1
1 P2
0 P3
1 P6
0
1
0 P7
1 P9
000*
001*
010*
011*
100*
101*
110*
111*
100*
100000*
100001*
100010*
100011*
100000*
101*
110*
110010*
110011*
111*
0 P4
1
0
1
P1
P1
P1
P1
P3
P4
P4
P4
P5
P6
P7
P8
P8
P9
0
1
0
1 P8
000*
001*
010*
011*
100*
100001*
100010*
100011*
100000*
101*
110*
110010*
110011*
111*
0 P5
1
Leaf pushing
Multi-bit trie
with fixed
stride
000
001
010
011
100
101
110
111
P1
P1
P1
P1
P3
P5
P7
P9
Multi-bit trie
with variable
stride
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
000
001
010
011
100
101
110
111
P5
P4
P4
P4
000
001
010
011
100
101
110
111
000
001
010 P8
011 P8
100
101
110
111
000
001
010
011
100
101
110
111
P5
P4
P4
P4
P3
P3
P3
P3
000 P7
001 P7
010 P8
P9
011 P8
100 P7
101 P7
110 P7
111 P7
Leaf pushing reduces
memory usage but
increases update time
P5
P4
P4
P4
00
01 P8
10
11
P1
P1
P1
P1
000
001
010
011
100
101
110
111
P5
Given a maximum trie
height h and a routing
table of size n dynamic
programming algorithm
computes optimal
variable stride trie in
O(nw2h)
DIMACS Tutorial on Algorithms for Next Generation Networks
August 6-8 2007
Controlled prefix expansion with stride 3
P1
P1
P1
P1
P2
P2
P2
P2
P3
P4
P4
P4
P4
P5
P6
P7
P8
P8
P9
Routing table
P1 0*
P2 1*
P3 100*
P4 1000*
P5 100000*
P6 101*
P7 110*
P8 11001*
P9 111*
Input
11000010
000*
001*
010*
011*
100*
101*
110*
111*
100*
100000*
100001*
100010*
100011*
100000*
101*
110*
110010*
110011*
111*
P1
P1
P1
P1
P3
P4
P4
P4
P5
P6
P7
P8
P8
P9
Longest matching prefix
P7
P2
Uni-bit trie
0 P1
1 P2
0 P3
1 P6
0
1
0 P7
1 P9
0 P4
1
0
1
0
1
0
1 P8
000*
001*
010*
011*
100*
100001*
100010*
100011*
100000*
101*
110*
110010*
110011*
111*
0 P5
1
Leaf pushing
Multi-bit trie
with fixed
stride
000
001
010
011
100
101
110
111
P1
P1
P1
P1
P3
P5
P7
P9
Multi-bit trie
with variable
stride
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
000
001
010
011
100
101
110
111
P5
P4
P4
P4
000
001
010
011
100
101
110
111
000
001
010 P8
011 P8
100
101
110
111
000
001
010
011
100
101
110
111
P5
P4
P4
P4
P3
P3
P3
P3
000 P7
001 P7
010 P8
P9
011 P8
100 P7
101 P7
110 P7
111 P7
Leaf pushing reduces
memory usage but
increases update time
P5
P4
P4
P4
00
01 P8
10
11
P1
P1
P1
P1
000
001
010
011
100
101
110
111
P5
Given a maximum trie
height h and a routing
table of size n dynamic
programming algorithm
computes optimal
variable stride trie in
O(nw2h)
DIMACS Tutorial on Algorithms for Next Generation Networks
August 6-8 2007
Controlled prefix expansion with stride 3
P1
P1
P1
P1
P2
P2
P2
P2
P3
P4
P4
P4
P4
P5
P6
P7
P8
P8
P9
Routing table
P1 0*
P2 1*
P3 100*
P4 1000*
P5 100000*
P6 101*
P7 110*
P8 11001*
P9 111*
Input
11000010
000*
001*
010*
011*
100*
101*
110*
111*
100*
100000*
100001*
100010*
100011*
100000*
101*
110*
110010*
110011*
111*
P1
P1
P1
P1
P3
P4
P4
P4
P5
P6
P7
P8
P8
P9
Longest matching prefix
P7
Uni-bit trie
0 P1
1 P2
0 P3
1 P6
0
1
0 P7
1 P9
0 P4
1
0
1
0
1
0
1 P8
000*
001*
010*
011*
100*
100001*
100010*
100011*
100000*
101*
110*
110010*
110011*
111*
0 P5
1
Leaf pushing
Multi-bit trie
with fixed
stride
000
001
010
011
100
101
110
111
P1
P1
P1
P1
P3
P5
P7
P9
Multi-bit trie
with variable
stride
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
000
001
010
011
100
101
110
111
P5
P4
P4
P4
000
001
010
011
100
101
110
111
000
001
010 P8
011 P8
100
101
110
111
000
001
010
011
100
101
110
111
P5
P4
P4
P4
P3
P3
P3
P3
000 P7
001 P7
010 P8
P9
011 P8
100 P7
101 P7
110 P7
111 P7
Leaf pushing reduces
memory usage but
increases update time
P5
P4
P4
P4
00
01 P8
10
11
P1
P1
P1
P1
000
001
010
011
100
101
110
111
P5
Given a maximum trie
height h and a routing
table of size n dynamic
programming algorithm
computes optimal
variable stride trie in
O(nw2h)
DIMACS Tutorial on Algorithms for Next Generation Networks
August 6-8 2007
Lulea bitmap compression
Compressed node
P1
P1
P1
P1
P3
P4
P4
P4
P5
P6
P7
P8
P8
P9
000*
001*
010*
011*
100*
100001*
100010*
100011*
100000*
101*
110*
110010*
110011*
111*
Input
11001010
P1 000*
000
1
P1
001*
001
0
P1 010*
010
0
P1
P1
011*
011
0
P3
100*
100
1
P5
P4 100001*
101
1
P4
100010*
P5
110
1
P9
P4
100011*
111
1
P5
100000*
P9
P6 101*
Repeating entries are stored only once in the
P7 110*
compressed array. An auxiliary bitmap is needed
P8 110010*
to find the right entry in the compressed node. It
P8 110011*
stores a 0 for positions that do not differ from the
P9 111*
previous one.
000
001
010
011
100
101
110
111
P1
P1
P1
P1
Representing node as tree bitmap
Longest matching prefix
P7
P2
P1
P2
P3
P4
P5
P6
P7
P8
P9
When the compression
bitmaps are large it is
expensive to count bits
during lookup. The
bitmap is divided into
chunks and a precomputed auxiliary
array stores the
number of bits set
before each chunk.
The lookup algorithm
needs to count only
bits set within one
chunk.
0*
1*
100*
1000*
100000*
101*
110*
11001*
111* DIMACS
000
001
010
011
100
101
110
111
0
0
0
0
1
0
1
0
Tutorial on
August 6-8 2007
0*
1*
00*
01*
10*
11*
000*
001*
010*
011*
100*
101*
Algorithms
110*
111*
1
1
0
0
0
0
0
0
0
0
1
1
for
1
1
P1
P2
P3
P6
P7
P9
Pointers to children and
prefixes are stored in
separate structures. Prefixes
of all lengths are stored,
thus leaf pushing is not
needed and update is fast.
Bitmaps have 1s
corresponding to entries that
are not empty.
Next Generation Networks
Bitmap supporting
fast counting
00000 1
00001 0
00010 0
00011 0
00100 1
00101 1
00110 1
00111 0
01000 1
01001 0
01010 1
01011 0
01100 1
01101 0 00 0
01110 0 01 4
01111 1 10 8
10000 1 11 13
10001 0
10010 1
10011 0
10100 1 13+0=13
10101 1
10110 1
10111 0
11000 0
11001 0
11010 0
11011 0
11100 1
11101 1
11110 1
11111 0
Binary search on ranges




Divide w-bit address space into maximal
continuous ranges covered by same prefix
Build array or balanced (binary) search tree
with boundaries of ranges
At lookup time perform O(log(n)) search
Not better than multi-bit tries with
compression, but it is not covered by patents
Binary search on prefix lengths




Core idea: for each prefix length represented in the routing table,
have a hash table with the prefixes
 Can find longest matching prefix after looking up in each hash
table the prefix of the address with corresponding length
 Binary search on prefix lengths is faster
Simple but wrong algorithm: if you find prefix at length x store it as
best match and look for longer matching prefixes, otherwise look
for shorter prefixes
 Problem: what if there is both a shorter and a longer prefix, but no
prefix at length x?
Solution: insert marker at length x when there are longer prefixes.
Must store with marker longest matching shorter prefix. Markers
lead to moderate increase in memory usage.
Promising algorithm for IPv6 (w=128)
Papers on longest matching prefix






G. Varghese “Network algorithmics an interdisciplinary approach
to designing fast networked devices”, chapter 11, Morgan
Kaufmann 2005
V. Srinivasan, G. Varghese “Faster IP lookups using controlled
prefix expansion”, ACM Trans. on Comp. Sys., Feb. 1999
M. Degermark, A. Brodnik, S. Carlsson, S. Pink “Small forwarding
tables for fast routing lookups”, ACM SIGCOMM, 1997
W. Eatherton, Z. Dittia, G. Varghese “Tree Bitmap : Hardware /
Software IP Lookups with Incremental Updates”, http://wwwcse.ucsd.edu/~varghese/PAPERS/willpaper.pdf
B. Lampson, V. Srinivasan, G. Varghese “IP lookups using
multiway and multicolumn search”, IEEE Infocom, 1998
M. Waldvogel, G. Varghese, J. Turner, B. Plattner, “Scalable highspeed IP lookups”, ACM Trans. on Comp. Sys., Nov. 2001
Overview


Longest matching prefix
Classification on multiple fields





Solution for two-dimensional case: grid of tries
Bit vector linear search
Cross-producting
Decision tree approaches
Signature matching
Packet classification problem


Required for security, recognizing packets with
quality of service requirements
Guard: prefixes or ranges for k header fields






Typically source and destination prefix, source and
destination port range, and exact value or * for protocol
All fields must match for rule to apply
Action: drop, forward, map to a certain traffic class
Input: a tuple with the values of the k header fields
Output: the action associated with the first rule that
matches the packet (rules are strictly ordered)
Size of problem: thousands of classification rules
Example of classification rule set
External time
server TO
Router that
filters traffic
Mail gateway M
Net
Internet
Internal time
server TI
Secondary
name server S
Destination IP
Source IP
Dest Port
Src Port
Protocol
Action
M
*
25
*
*
R1
M
*
53
*
UDP
R2
M
S
53
*
*
R3
M
*
23
*
*
R4
TI
TO
123
123
UDP
R5
*
Net
*
*
*
R6
Net
*
*
*
TCP/ACK
R7
*
*
*
*
*
R8
A geometric view of packet classification
R1
R1
Destination address space
R3
R2
Source address space


R3
R2
Source address space
In theory number of regions defined can be much larger than number of rules
Any algorithm that guarantees O(n) space for all rule sets of size n needs
O(log(n)k-1) time for classification
The two dimensional case: source and
destination IP addresses



For each destination prefix in rule set, link to
corresponding node in destination IP trie a trie with
source prefixes of rules using this destination prefix
Matching algorithm must use backtracking to visit all
source tries
Grid of tries: by pre-computing “switch pointers” in
destination tries and propagating some information
about more general rules, matching may proceed
without backtracking



Memory used proportional to number of rules
Matching time O(w) with constant depending on stride
Extended grid of tries handles 5 fields and has good
run time and memory in practice
Dest IP Src IP Dest Port Src Port Proto
M
*
25
*
*
M
*
53
*
UDP
M
S
53
*
*
M
*
23
*
*
TI
TO
123
123
UDP
*
Net
*
*
*
Net
*
*
*
TCP
*
*
*
*
*







Action
R1
R2
R3
R4
R5
R6
R7
R8
Dest IP
11110111
00001111
00000111
00000101
Source IP
S
11110011
TO 11011011
Net 11010111
*
11010011
Src Port
123 11111111
*
11110111
Proto
UDP 11111101
TCP 10110111
*
10110101
M
TI
Net
*
Dest Port
25 10000111
53 01100111
23 00010111
123 00001111
*
00000111
00000101+
11010011+
00000111+
11110111+
10110111
00000001
Bit vector approaches do linear search through rule set
For each field we pre-compute a structure (e.g. trie) to find most
specific prefix or range distinguished by rule set
For each rule, a single bit represents whether a given most
specific prefix matches rule or not
 We associate with each range a bitmap of size n encoding which
of the rules may match a packet in that prefix or range
Classification algorithm first computes for each field of the packet
the most specific prefix/range it belongs to
By then AND-ing together the k bitmaps of size n we find
matching rules
Works well for hardware solutions that allow wide memory reads
Scales
poorly to large rule sets
DIMACS Tutorial on Algorithms for Next Generation Networks
August 6-8 2007
R8
Dest IP Src IP Dest Port Src Port Proto
M
*
25
*
*
M
*
53
*
UDP
M
S
53
*
*
M
*
23
*
*
TI
TO
123
123
UDP
*
Net
*
*
*
Net
*
*
*
TCP
*
*
*
*
*
Dest IP
M
TI
Net
*
Action
R1
R2
R3
R4
R5
R6
R7
R8
Cross-producting performs longest prefix
matching separately for all fields and combines
the results in a single step by looking up the
matching rule in a pre-computed table explicitly
listing the first matching rule for each element of
the cross-product. The size of this table is the
product of the numbers of recognized
prefixes/ranges for the individual fields. Due to its
memory requirements this method is not feasible.
Src IP
S
TO
Net
*
Dest Port
25
53
23
123
*
Src Port
123
*
Proto
UDP
TCP
*
3*120+3*30+4*6+1*3+1=478
0
1
2
3
4
5
Cross Product
M,S,25,123,UDP
M,S,25,123TCP
M,S,25,123,*
M,S,25,*,UDP
M,S,25,*,TCP
M,S,25,*,*
Action
R1
R1
R1
R1
R1
R1
…
478 *,*,*,*,TCP
479 *,*,*,*,*
R8
R8
4*4*5*2*3=480
Equivalenced cross-producting (a.k.a. recursive flow classification or RFC)
combines the results of the per-field longest matching prefix operations two by two.
The pairs of values are grouped in equivalence classes and in general there are
much fewer equivalence classes than pairs of values. This leads to significant
memory savings as compared to simple cross-producting. This algorithm provides
fast packet classification, but compared to other algorithms, the memory
requirements are relatively large (but feasible in some settings).
Dest IP
Rule
Class
- Src IP
bitmap
0 M,S
11110011
C1
1 M,TO
11010011 C2
2 M,Net
11010111
C3
3 M,*
11010011 C2
4 TI,S
00000011 C4
5 TI,T0
00001011 C5
6 TI,Net
00000111 C6
7 TI,*
00000011 C4
8 Net,S
00000011 C4
9 Net,TO 00000011 C4
10 Net,Net 00000111 C6
11 Net,*
00000011 C4
12 *,S
00000001 C7
13 *,TO
00000001 C7
14 *,Net
00000100 C8
15 *,*
00000001 C7
16 entries, 8 distinct classes
Dest Src Dest Src Proto
IP
IP
Port Port
DIMACS Tutorial on Algorithms for Next Generation Networks
August 6-8 2007
Final result
Decision tree approaches

At each node of the tree test a bit in a field or
perform a range test







Large fan-out leads to shallow trees and fast classification
Leaves contain a few rules traversed linearly
Interior nodes may contain rules that match also
Tests may look at bits from multiple fields
A rule may appear in multiple nodes of the decision
tree – this can lead to increased memory usage
Tree built using heuristics that pick fields to compare
on that divide remaining rules relatively evenly
among descendants
Fast and compact on rule sets used today
Papers on packet classification






G. Varghese “Network algorithmics …”, chapter 12
V. Srinivasan, G. Varghese, S. Suri, M. Waldvogel, “Fast and
Scalable Layer Four Switching”, ACM SIGCOMM, Sep. 1998
F. Baboescu, S. Singh, G. Varghese, “Packet classification for
core routers: Is there an alternative to CAMs?”, IEEE Infocom,
2003
P. Gupta, N. McKeown, “Packet classification on multiple fields”,
ACM SIGCOMM 1999
T. Woo, “A modular approach to packet classification: Algorithms
and results”, IEEE Infocom, 2000
S. Singh, F. Baboescu, G. Varghese, “Packet classification using
multidimensional cutting”, SIGCOMM, 2003
Overview



Longest matching prefix
Classification on multiple fields
Signature matching


String matching
Regular expression matching w/ DFAs and D2FAs
Signature matching




Used in intrusion prevention/detection, application
classification, load balancing
Guard: a byte string or a regular expression
Action: drop packet, log alert, set priority, direct to
specific server
Input: byte string from the payload of packet(s)



Hence the name “deep packet inspection”
Output: the positions at which various signatures
match or the identifier of the “highest priority”
signature that matches
Size of problem: hundreds of signatures per protocol
String matching

Most widely used early form of deep packet
inspection, but the more expressive regular
expressions have superceded strings by now


Still used as pre-filter to more expensive matching
operations by popular open source IDS/IPS Snort
Matching multiple strings a well-studied problem



A. Aho, M. Corasick. “Efficient string matching: An aid to bibliographic search”, Communications of the ACM, June 1975
Many hardware-based solutions published in last decade
Matching time independent of number of strings, memory
requirements proportional to sum of their sizes
Regular expression matching

Deterministic and non-deterministic finite automata
(DFAs and NFAs) can match regular expressions



NFAs more compact but require backtracking or keeping
track of sets of states during matching
Both representations used in hardware and software
solutions, but only DFA based solutions can guarantee
throughput in software
DFAs have a state space explosion problem



From DFAs recognizing individual signatures we can build
a DFA that recognizes entire signature set in a single pass
Size of combined DFA much larger than sum of sizes for
DFAs recognizing individual signatures
Multiple combined DFAs are used to match signature set
S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J. Turner, “Algorithms to Accelerate Multiple Regular Expressions
Matching for Deep Packet Inspection”, ACM SIGCOMM, September 2006
Delayed Input DFA (D2FA)
Deterministic finite automaton (DFA)
State 0
25
18
25
41
41
41
5
5
Input …410052…
Crt. state
12
1
State 1
19
12
12
4
2
8
2
8
State 2
19
12
12
4
4
4
8
8
State 3
25
18
25
6
41
5
41
5
…
If the “current state” variable
meets an acceptance condition
(e.g. whether the state identifier
is larger than a given threshold),
the automaton raises an alert.
D2FAs build on the observation that for many pairs
of states, the transition tables are very similar and it
is enough to store the differences. The lookup
algorithm may need to follow multiple default
transitions until it finds a state that explicitly stores
a pointer to the next state it needs to transition to.
Since this is a throughput concern, the algorithm for
constructing D2FAs allows the user to set a limit on
the length of the maximum default path.
DIMACS Tutorial on Algorithms
August 6-8 2007
Default
transitions
State 0 State 1
State 3
2
25
18
25
41
41
41
5
5
Set of
regular
expr.
State 2
0
19
12
12
4
4
4
8
8
2
8
2
6
…
5
41
D2FAs with no bound on default
path length
Memory
D2FAs
d.p.l.≤4
Avg. d.p.l.
Max d.p.l.
Memory
Cisco590
18.32
57
0.80%
1.56%
Cisco103
16.65
54
0.98%
1.54%
Cisco7
19.61
61
2.58%
3.31%
Linux56
7.68
30
1.64%
1.87%
Linux10
5.14
20
8.59%
9.08%
Snort11
5.86
9
1.57%
1.66%
Bro648
6.45
17
0.45%
0.51%
The memory columns report the ratio between the number of
2FA and the corresponding DFA.
transitions
used by the DNetworks
for Next
Generation
Conclusions



Networking devices implement more and more
complex data plane processing to better control
traffic
The algorithms and data structures used have big
performance impact
Often set of rules to be matched against has specific
structure

Algorithms exploiting this structure may give good
performance even if it is impossible to find an algorithm that
gives good performance on all possible rule sets
That’s all folks!