A High Throughput String Matching Architecture for Intrusion Detection and Prevention Lin Tan U of Illinois, Urbana Champaign Tim Sherwood UC, Santa Barbara.

Download Report

Transcript A High Throughput String Matching Architecture for Intrusion Detection and Prevention Lin Tan U of Illinois, Urbana Champaign Tim Sherwood UC, Santa Barbara.

A High Throughput
String Matching Architecture
for Intrusion Detection and Prevention
Lin Tan
U of Illinois, Urbana Champaign
Tim Sherwood
UC, Santa Barbara
Outline
• Why String Matching
– Matching against multiple strings
• The Aho-Corasick Algorithm
– The Devil in the Constants
• A Bit-Split Algorithm
• Hardware Design and Analysis
• Conclusions
To Protect and Serve
• Our machines are constantly under attack
• Cannot rely on end users, we need networks
which actively defend themselves.
IDS/IPS are promising ways of providing protection
Market for such systems: $918.9 million by the end of 2007.
Snort: a widely accepted open source IDS
This requires the protection system to be able to
operate at 10 to 40 Gb/s. (We aim at current and next
generation networks.)
Our Contributions
• String Matching Architecture:
– 0.4MB and 10Gbps for Snort rule set ( >10,000
characters)
• Bit-Split String Matching Algorithm
– Reduces out edges from 256 to 2.
• Performance/area beats the best techniques
we examined by a factor of 10 or more.
Scanning for Intrusions
CodeRed worm:
web flow established
uricontent with “/root.exe”
Software
Scan
IDS
Traffic In
Traffic Out
Most IDS define a set of rules.
A string defines a suspicious transmission.
We are not building a full IDS, rather building the
primitives from which full systems can be built
Multiple String Matching
• The multiple string matching algorithm:
– Input: A set of strings/patterns S, and a buffer b
– Output: Every occurrence of an element of S in b
A string can be anywhere in the payload of a packet.
Input:
A B D FC A B
Strings:
A B
CA
A B
– Extra constraint: b is really a stream
• How to implement:
Option 1) search for each string independently
Option 2) combine strings together and search all at once
Why hardware
• Snort: >1,000 rules, growing at 1 rule/day or more
• Active research into automated rule building
• Strings are not limited to be just [a-z]+
• We need a high speed string matching technique
with stringent worst case performance.
• Many algorithms are targeted for average case
performance. Aho-Corasick can scan once and
output all matches. But it is too big to be on-chip.
Outline
• Why String Matching
– Matching against multiple strings
• The Aho-Corasick Algorithm
– The Devil in the Constants
• A Bit-Split Algorithm
• Hardware Design and Analysis
• Conclusions
The Aho-Corasick Algorithm
•
Given a finite set P of patterns, build a
deterministic finite automaton G accepting
the set of all patterns in P.
An AC Automaton Example
• Example: P = {he, she, his, hers}
Initial State
Transition Function
State
Accepting State
h
h
h
2
•The Construction:
linear time.
•The search of all
patterns in P: linear
time
h
h
s
8
s
9
4
S h
7
h
h
i
6
S
3
i
S
r
s
S
1
e
0
e
h
r
S
S
5
h
S
(Edges pointing back to State 0 are not shown).
Linear Time: So what’s the problem
• How to implement it on chip?
256 Next State Pointers
2
…
…
…
…
16,384
…
0
0
<14>
1
2
<14>
<14>
3
<14>
1
255
<14>
• Problem: Size too big to be on-chip
– ~ 10,000 nodes
– 256 out edges per node
– Requires 16,384*256*14 = ~10MB
• Solution: partition into small state machines
– Less strings per machine
– Less out edges per machine
Outline
• Why String Matching
– Matching against multiple strings
• The Aho-Corasick Algorithm
– The Devil in the Constants
• A Bit-Split Algorithm
• Hardware Design and Analysis
• Conclusions
Our Main Idea: Bit-Split
• Partition rules (P) into smaller sets (P0 to Pn)
• Build AC state-machine for each subset
• For each DFA Pi, rip state-machine apart into
8 tiny state-machines (Bi0 through Bi7)
• Each of which searches for 1 bit in the 8 bit
encoding of an input character
– Only if all the different B machines agree can
there actually a match
Binary Encoding
P0 = { he, she, his, hers }
An example of Bit-Split
P0 = { he, she, his, hers }
P0
B03
0001 0000
0000
0000 0001
0000
0110 1000
h
h
S
r
h
h
i
6
s Sh
8
7
s
9
0
3
r
e
5
1
1
b2 { 0 ,3 }
1
0
1
{ 0,3 }
b4{0,1,4}
S
4
h
b1 { 0 ,1 }
0
b3 {0,1,2,6 }
0
h
h
i
1
1
0111 0011
s S
S
1
e
2
h
b0 {0}
0
0
0 0 b6{0,1,2,5,6}
S
b3{0,1,2,6}
h
S
(Edges pointing back to State 0 are not shown).
1
0
1
1
0
b5{0,3,7,8}
1
b7{0,3,9}
Compact State Set
P0 = { he, she, his, hers }
P0
B03
0
b0 { }
1
1
h
h
h
S
2
r
h
h
i
6
s Sh
8
7
s
9
s
S
1
e
0
1
b1 { }
S
0
3
h
h
i
r
e
5
b4 {
0
S
4
h
b2 {
1
}
1
}
0
0 0 b6{ 2,5 }
S
0
b3{ 2 }
h
S
(Edges pointing back to State 0 are not shown).
1
1
1
0
b5{7}
1
b7{9}
An example of Bit-Split
P0 = { he, she, his, hers }
P0
B03
B04
b0 {}
h 0 s
h
h
e
2
h
r
i
6
s Sh
8
9
1
1
3
0
1
b2{}
b1{}
0
S h
7
s
0
S
b1{} 1
S
1
b0 {}
i
h
h
r
h
4 S
e
5
0
1
S
0
b3{2}
S
(Edges pointing back to State 0 are not shown).
0
1
b5 {}
b6{2,5}
b6{2,5}
0
0
1
1
1
0
1
1
1
b3 {} 1
1 0
0
0
b5{7}
1
h
0
b4{2}
1
b4 {}
0
b2{}
0
b8{2,7}
1
b7 {} 0
b7{9}
b9{9}
0
1 0
Nice Properties
• The number of states in Bij is rigorously
bounded by the number of states in Pi
• No exponential blow up in state
• Linear construction time
• Possible to traverse multiple edges at a time
to multiply throughput
Matching on the example
h
h
h
S
2
r
h
h
i
6
s Sh
8
7
s
9
s
S
1
e
0
S
3
h
h
i
S
4
h
r
e
S
5
h
S
Input stream:
h x h e rs
Only scan the input stream once.
Matching on the example
hxhe
0100
P0
1110
B03
B04
b0 {}
h 0 s
h
h
e
2
h
r
i
6
s Sh
8
9
1
1
3
0
1
b2{}
b1{}
0
S h
7
s
0
S
b1{} 1
S
1
b0 {}
i
h
h
r
h
4 S
e
5
0
1
S
0
b3{2}
S
0
1
b5 {}
b6{2,5}
b6{2,5}
0
0
1
1
1
0
1
1
1
b3 {} 1
1 0
0
0
b5{7}
1
h
0
b4{2}
1
b4 {}
0
b2{}
0
b8{2,7}
1
b7 {} 0
b7{9}
b9{9}
1 0
0
How do you “combine” the results from the different state machines?
Only if all the state machines agree, is there actually a match.
How to Implement
• The AC state machine is equivalent to the 8
tiny state machines.
• The 8 tiny state machines can run
independently, which means in parallel
• Intersection done with bit-wise AND.
• 8 is intuitive but not optimal
• How to build a system to implement this
algorithm?
– Our algorithm makes it feasible to be on-chip
A Hardware Implementation
State Machine Tile
Rule Module 0
Tile 0
Tile 3
Control
Block
2-bit Input [0:1]
[6:7]
2
<8>
Partial Match Vector
[2:3]
16
16
[4:5]
Tile 2
8
<8>
<8>
Partial Match Vector
<8>
16
Full Match Vector
8
16
4:1 Mux
…
Input
Output Latch
Rule Module N
8
0
1
2
3
255
Rule Module 1
8
<16>
…
Tile 1
Complete Set of Matches for All Rules
4 Next State Pointers
decoder
8
Current State <8>
Byte from Payload
String Match Engine
Config
Data
2 bits from
each byte
Partial
Match
Vector
• A rule module is equivalent to an AC state machine
• Rule modules, tiles are structurally equivalent
• All full match vectors are concatenated to indicate which
strings are matched
• One tile stores one tiny bit-split state machine
An efficient Implementation
Cycle
Cycle
Cycle
Cycle
3
2
1
0
e
h
x
h
01
01
01
01
10
10
11
10
01
10
10
10
2
2
2
Tile 0
00 01 10 11
h
x
h
e
0
0
1
0
0
1
0
2
0
0
0000
2
0
3
0
0
1000
3
0
4
0
0
1110
4
0
4
0
0
1111
2
Tile 2
Tile 1
00 01 10 11 PMV
PMV
0000
01
00
00
00
h
x
h
0
0
0
1
2
0000
0
1
0
2
0
0000
1
1
0
3
0
0000
2
1
0
5
0
0000
3
1
6
5
0
4
7
0
2
1
2
3
0
0
0
0
0
0
3
4
3
2
2
5
0000
0000
1000
4
0
0
6
2
0000
h
x
e
h
Tile 3
00 01 10 11 PMV
00 01 10 11 PMV
0
1 0
0
2
0000
1
1 3
0
2
0000
2
4 0
0
2
0000
0000
3
1 0
5
6
1000
0
1000
4
1 7
0
2
0000
h
h
x
e
5
0
0
4
7
0010
5
0
4
5
0
0000
5
1 0
0
8
0000
6
6
0
0
3
5
1100
6
7
0
2
0
1100
6
4 0
0
2
0010
7
7
0
0
4
2
0001
7
9
0
3
0
0000
7
1 0
5
6
1100
8
8
8
1
0
3
0
0010
8
4 0
0
2
0001
9
9
9
1
0
3
0
0001
9
e
h
x
h
1000
0000
0000
0000
e
5
e
h
x
h
1111
1110
1000
0000
e
h
x
h
1100
0000
0000
0000
Cycle
Cycle
Cycle
Cycle
3+P
2+P
1+P
0+P
e
h
x
h
1000
0000
0000
0000
1000
0000
0000
0000
Performance of Hardware
Key Metric: Throughput*Character/Area
Related Work
• Software based
– Good for ~100Mb/s, common case
• FPGA-based
– Many schemes map rules down to a specialized circuit
• Near optimal utilization of hardware resources
– Implementing state machines on block-RAMs [Cho and MangioneSmith]
– Concurrent to our work: mapping state machines to on-chip SRAM
[Aldwairi et. al.]
– Bloom filters [Dharmapurikar et al.]
• Excellent filter in the common case
• TCAM-based
– Require all patterns to be shorter or equal to TCAM width
– Cutting long patterns: 2Gbps with 295KB TCAM [Yu et. al.]
Conclusions
• New Tile-based Architecture
– 0.4MB and 10Gbps for Snort rule set ( >10,000
characters)
– Possible to be used for other applications, e.g. IP
lookups, packet classification.
• New Bit-split Algorithm:
– General purpose enough for many other applications, e.g.
spam detection, peephole optimization, IP lookups,
packet classification, etc.
– Feasible to be implemented on other tile-based
architecture.
Thank you! Questions?
• Backup Slides
An efficient Implementation
Cycle
Cycle
Cycle
Cycle
3
2
1
0
e
h
x
h
01
01
01
01
10
10
11
10
01
10
10
10
2
2
2
Tile 0
00 01 10 11
h
x
h
e
0
0
1
0
0
1
0
2
0
0
0000
2
0
3
0
0
1000
3
0
4
0
0
1110
4
0
4
0
0
1111
2
Tile 2
Tile 1
00 01 10 11 PMV
PMV
0000
01
00
00
00
h
x
h
0
0
0
1
2
0000
0
1
0
2
0
0000
1
1
0
3
0
0000
2
1
0
5
0
0000
3
1
6
5
0
4
7
0
2
1
2
3
0
0
0
0
0
0
3
4
3
2
2
5
0000
0000
1000
4
0
0
6
2
0000
h
x
e
h
Tile 3
00 01 10 11 PMV
00 01 10 11 PMV
0
1 0
0
2
0000
1
1 3
0
2
0000
2
4 0
0
2
0000
0000
3
1 0
5
6
1000
0
1000
4
1 7
0
2
0000
h
h
x
e
5
0
0
4
7
0010
5
0
4
5
0
0000
5
1 0
0
8
0000
6
6
0
0
3
5
1100
6
7
0
2
0
1100
6
4 0
0
2
0010
7
7
0
0
4
2
0001
7
9
0
3
0
0000
7
1 0
5
6
1100
8
8
8
1
0
3
0
0010
8
4 0
0
2
0001
9
9
9
1
0
3
0
0001
9
e
h
x
h
1000
0000
0000
0000
e
5
e
h
x
h
1111
1110
1000
0000
e
h
x
h
1100
0000
0000
0000
Cycle
Cycle
Cycle
Cycle
3+P
2+P
1+P
0+P
e
h
x
h
1000
0000
0000
0000
1000
0000
0000
0000