Transcript Document

Efficient Regular Expression Evaluation:
Theory to Practice
Michela Becchi and Patrick Crowley
ANCS’08
Motivation

Size and complexity of rule-set increased in recent
years
» Snort, as of November 2007
– 8536 rules, 5549 Perl Compatible Regular Expressions
 99% with character ranges ([c1-ck],\s,\w…)
 16.3 % with dot-star terms (.*, [^c1..ck]*
 44 % with counting constraints (.{n.m}, [^c1..ck]{n,m})

Several proposals to accelerate regular expression
matching
» FPGA
» Memory centric architecture
Michela Becchi – 2/27/2008
11/06/2008
2
Objectives

Can we converge distinct algorithmic techniques into
a single proposal also for large data-sets?

Can we apply techniques intended for memory centric
architectures also on FPGAs?
Provide tool to allow anybody to implement a
high throughput DPI system on the
architecture of choice
Michela Becchi – 2/27/2008
11/06/2008
3
Target Architectures
Regex-Matching
Engine
Memory-centric architectures
FPGA
logic
General
purpose
processors
Network
processors
FPGA /
ASIC
+
memory
available parallelism
Michela Becchi – 2/27/2008
11/06/2008
4
Challenges
DFA
NFA
Memory-centric architectures
FPGA
logic
General
purpose
processors
Logic cell
utilization
 Clock
frequency
Network
processors
FPGA /
ASIC
+
memory

Michela Becchi – 2/27/2008
11/06/2008
Memory space
 Memory bandwidth

5
D2FA: default transition compression

Observations:
» DFA state: set of |∑| next state pointers
» Transition redundancy

Idea:
» Differential state representation through use of non-consuming default
transitions
s3
a
b
s1

s1
s5
b
c
s4
s5
s3
a
s2
s4
c
s3
a
b
c
s4
s2
s6
In general:
c
s6
DEFAULT PATH
∑
c1
Michela Becchi – 2/27/2008
11/06/2008
c2
c1
c3
6
c4
D2FA algorithms

Problem: set default transitions so to
1.
2.

Maximize memory compression
Minimize memory bandwidth overhead
[Kumar et al, SIGCOMM’06]
» Bound dpMAX on max default path length
» O(dpMAX+1) memory accesses per input char
» Better compression for higher dpMAX

[Becchi et al, ANCS’07]
» Only backward-directed default transitions (skipping k levels)
» Amortized memory bandwidth O((k+1/k)N) on N input chars
» Depth-first traversal → at DFA creation
Memory bandwidth = O((dpMAX+1)N)
Time complexity
= O(n2logn)
Space complexity = O(n2)
vs.
Memory bandwidth = O((k+1/k)N)
Time complexity
=O(n2)
Space complexity =O(n)
Compression w/ k=1 ~ compression w/ dpMAX=∞
Michela Becchi – 2/27/2008
11/06/2008
7
DFA alphabet reduction
[a-z]
3/1
[a-zA-Z]
1
0
2
A
[0-9B-Z]
4/2
[a-zA-Z]
5/3
[B-Z]
Effective for:
 Ignore-case regex
 Char-ranges
 Never used chars
A
=

’
[a-z]
0
A
1
[B-Z]
2
[0-9]
3
[^0-9a-zA-Z]
4
0
Michela Becchi – 2/27/2008
11/06/2008
8
[0-2]
4/2
[0-2]
5/3
2
1
+
0
2
1
Alphabet translation table
3/1
[2-3]
1
Multiple-stride DFAs

[Brodie et al, ISCA 2006]

Idea:
» Process stride input chars at a time
DFA w/ stride 2
DFA
a:1-8
0
1
b
2
c
3
d
11
4/1
[a
-f]
a
a
[a-f]a [a-cef]a
d
e
5
c
6
e
7
f
ab
8/2
0
b:2-8

[b-f
]b
bc
3
ab
22
[b-f]b
b
bc
da
…
5
6
Observations:
» Mechanism used on small DFAs (1-2 regex)
» No distinct accepting state handling
Michela Becchi – 2/27/2008
11/06/2008
9
dd
1/1
4/1
Multiple stride + alphabet reduction

Stride s → Alphabet ∑s
» ∑=ASCII alphabet ►| ∑2|=2562=65,536; | ∑4|=2564~4,294M

Effective alphabet much smaller
» Char grouping: [a-cef]a, [b-f]b
2-DFA
0
1
1
d
b
2
c
3
d
4/1
e
0
5
c
6
e
7
f
8/2
3
dd
2
[b-f
]b
bc
1/1
4/1
…
5
6
b:2-8

da
ab
ab
b
bc
[b-f]b
a
a:1-8
[a
-f]
a
DFA
[a-f]a [a-cef]a
Alphabet reduction may be necessary to make stride doubling
feasible on large DFAs
DFA
alphabet
reduction
TxTable1
Michela Becchi – 2/27/2008
11/06/2008
Stride
doubling
alphabet
reduction
TxTable2,1
10
2-DFA
Stride
doubling
alphabet
reduction
TxTable4,2,1
4-DFA
Multiple stride + default transitions

Compression
» Default transitions eliminate transition redundancy
» In multiple stride DFAs
– # of states does not substantially change
– # of transitions per state increases exponentially (  stride )
Fraction distinct/total transitions decreases
Increased potential for compression!

Accepting state handling
DFA
a
0
a:1-8
1
2-DFA
1
d
b
2
c
3
d
4/1
2
e
c
6
e
7
bc
3
dd
4/1
0/1
0
b
5
1/1
[a-f]a
f
8/2
5
b:2-8
6
» Duplicated states have same outgoing transitions as original states but
different depth
– Default transition will remove all outgoing transitions from new accepting states
Michela Becchi – 2/27/2008
11/06/2008
11
Multiple stride + default transitions (cont’d)

Problem:
» For large ∑ and stride, uncompressed DFA may be unfeasible
– Out of memory when generating a 2K node, stride 4 DFA on a
Linux machine w/ 4GB memory

Solution
» Perform default transition compression during DFA creation
– Use [Becchi et al, ANCS 2006] compression algorithm
 In the situation above, only 10% memory used
DFA
alphabet
reduction
Stride
doubling +
compression
TxTable1
Michela Becchi – 2/27/2008
11/06/2008
alphabet
reduction
compressed
2-DFA
TxTable2,1
12
Stride
doubling +
compression
alphabet
reduction
TxTable4,2,1
compressed
4-DFA
Putting everything together…
1-22 regex
48-1,940 states
DFA
alphabet reduction
Stride-2
transformation
Stride-2 DFA
||=53-470
97.9-99.5%
transitions removed
Avg 3-5 labeled tx/state
Michela Becchi – 2/27/2008
11/06/2008
||=25-44
default transition
compression
Compressed
DFA
96.3-98.5%
transitions removed
avg 1-2 labeled tx/state
alphabet reduction
default transition
compression
Compressed
Stride-2 DFA
13
-
Same memory
bandwidth requirement
-
Initial size=40X-80X
final size
NFA
b
1
b
2
c
3
d
1.
4/1
2.
b
a
5
a
*
b
6
3.
c
b
a
0
b
9
10
7
e
4.
8/2
5.
*
c
11
f
ab+cd
ab+ce
ab+c.*f
b[d-f]a
bdc
12/3
b
b
13
16
d-f
d
14
17
a
c
15/4
d
18/5
b
b
a
*
c
1
2
3
*
*
0
6
b
8
e-f
9
11
14
10/4
a
d
Michela Becchi – 2/27/2008
11/06/2008
a
c
e
12/5
4/1
5/2
f
f
7/3
Multiple stride + alphabet reduction

Stride doubling
NFA
*
*
0
a
1
c
2
b
c
e
5
*
2-NFA
*
0
ab
.a
ac
2
c
d
3
4/1
6/2
cd
.c
3
d.
4/1

Avoid new state
creation

Keep multiple
transitions on the
same symbol
separated
bc
b.
1
cc
5
ce
ce,e.
6/2
cc

Alphabet reduction:
» Clustering-based algorithm as for DFA, but sets of target
states are compared
Michela Becchi – 2/27/2008
11/06/2008
15
FPGA implementation
INIT
INPUT
log|∑’|
klog|∑|
Alphabet
Tx
r
Decoder
MATCH
NFA
|∑’|
CLK
One-hot encoding [Sidhu & Prasanna]
Quine-McCluskey like
minimization scheme
S2 ci
S3 ck
S2
cm
=
S1
cn
S1
S3
ci ck
c1
cm cn
+ logic reduction schemes
c2
S2 ci
∑-{bBcCdD}={aA}
S2
S3
S1
reset
ci
(c1=b OR c1=B) AND NOT (c2=a OR c2=A)
Michela Becchi – 2/27/2008
11/06/2008
S1
=
16
S1
∑-{ci,ck}
=
S3
S1
ci ck
FPGA Results - throughput
8
Throughput (Gpbs)
7
6
5
4
stride 1, full alp.
3
stride 1, red. alp.
2
stride 2, red. alp.
1
0
any_99
mail_79
Rule-set
Michela Becchi – 2/27/2008
11/06/2008
17
http_406
FPGA Results – logic utilization
#s=7,864
∑1=64
∑2=2206
4000
3500
3000
# slices
2500
2000
#s=2,086
∑1=78
∑2=1,969
#s=2,147
∑1=68
∑2=1640
stride 1, full alp.
1500
stride 1, red. alp.
1000
stride 2, red. alp.
500
0
any_99
mail_79
http_406
Rule-set

Utilization:
» 8-46% on XC5VLX50 device (7,400 slices)
» XC5VLX330 device has 51,840 slices
Michela Becchi – 2/27/2008
11/06/2008
18
ASIC – projected results
Content addressing w/ 64 bit words:
-98% states compressed w/ stride 1
-82% states compressed w/ stride 2
Regex
partitioning
into multiple
DFAs
Stride = 1
Rule
-set |Σ|
#states
Stride = 2
Memory footprint
Memory footprint
Compressed Full
states
states
Compressed Full
states
states
|Σ|
#states
k-NFA any
78
2,086
-
-
1969
2,091
-
-
k-DFA any1
59
23,846
505KB
200 KB
850
28,223
356KB
32MB
any2
45
86,977
2.9 MB
55 KB
579
102,940
1.27MB
81MB
any3
60
14,084
299MB
48 KB
627
19,344
244KB
16 MB
Throughput: SRAM@500 MHz
 2-4 Gbps for stride 1
 4-8 Gbps for stride 2
Michela Becchi – 2/27/2008
11/06/2008
Alternative
representation:
decoders in ASIC
or instruction
memory
19
Conclusion

Algorithm:
» Combination of default transition compression, alphabet
reduction and stride multiplying on potentially large DFAs
» Extension of alphabet reduction and stride multiplying to NFAs

FPGA Implementation:
» Use of one-hot encoding w/ incremental improvement
schemes
» Logic minimization scheme for alphabet reduction & decoding

Additional aspects:
» Multiple flow handling: FPGA vs. memory centric architectures
» Design improvements tailored to specific architectures and
data-sets:
– Clustering into smaller NFAs and DFAs to allow smaller alphabets
w/ larger strides
Michela Becchi – 2/27/2008
11/06/2008
20
Thank you!

Questions?
http://regex.wustl.edu
Michela Becchi – 2/27/2008
11/06/2008
21