Predictive Publish/Subscribe Matching

Download Report

Transcript Predictive Publish/Subscribe Matching

MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Predictive Publish/Subscribe Matching
Hans-Arno Jacobsen
Joint work with Vinod Muthusamy
& Haifeng Liu
P-ToPSS
project
University of Toronto
Little Anecdote
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Date: Mon, 14 Sep … 10:37:26 -0400
From: "security@noc ... "
To: …
Cc: … CNS Security Admin
Subject: DDoS attack originating from …
2
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
/var/log/secure* & LogWatch
aaron/password from 211.43.206.53: …
abdullah/password from 211.43.206.53:
abraham/password from 211.43.206.53:
abram/password from 211.43.206.53:
account/password from 142.150.237.133:
account/password from 211.43.206.53:
adam/password from 211.43.206.53:
addison/password from 211.43.206.53:
aditya/password from 211.43.206.53:
admin/password from 142.150.237.133: 18 Time(s)
admin/password from 211.43.206.53: 18 Time(s)
administrator/password from 142.150.237.133: 3 Time(s)
administrator/password from 211.43.206.53: 3 Time(s)
jacobsen/password from 191.43.206.53: 2 Time(s)
3
And So It Happened:
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Post-mortem forensics via events across different logs
…
John
…
John
…
John
…
John
denied
211.43.206.53 successful timestamp
logoff
timestamp
190.35.106.46 successful timestamp
password changed
Had set user john with
password john! 
4
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Predictive Analytics?
• Series of failed login attempts from same IP
– System is under attack
• Series of failed login attempts from same IP,
followed by successful login from that IP, followed
by immediate logoff
– System compromised
• Could we predict that the system is going to be
compromised soon with a certain probability,
after observing a partial match of the above
pattern?
– E.g.,: "failed logins from IP, successful login from IP”
Compromised?
Compromised?
Compromised?
5
Events, Subscriptions &
Publish/Subscribe
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• Here, events are
– Login attempts, logoff, system compromised
• Here, subscriptions are
– Specific patterns of interest
• Series of login attempts from same IP
• Series of login attempts from same IP, followed by logoff
• The publish/subscribe system is the abstraction
that matches subscriptions based on events
observed
• A match detects the event, e.g., system
compromised
6
Outline
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• Predictive Toronto Publish/Subscribe System
• Event & subscription language model
• Matching with P-ToPSS
• Predicting with P-ToPSS
• Evaluation
7
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
P-ToPSS is Latest ToPSS Member
• For many applications raising an alert after a
malicious activity occurred is too late
–
–
–
–
Credit card fraud (fraud committed)
Network intrusion (system compromised)
Problem determination (problem occurred)
Root-cause analysis (system crashed, poor user experience)
• Capability to predict the probability that a given
subscription will match in the future is needed.
• P-ToPSS computes the probability that a
subscription will match based on the event history
and based on partial matches observed so far.
8
P-ToPSS Model
Match
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
cs1 is fully matched
Engine
Publish/Subscribe matching problem
• Find all matches
Publish/Subscribe prediction problem
• Find partial matches
• Determine subscriptions with
matching probability > threshold
9
Event Model
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
An event: e = {(a1,v1),(a2,v2), …(an,vn)}
Event stream: {e1, e2, … ek, …}
Events are ordered (system timestamps)
10
Subscription Language Model
• Primitive subscriptions
– S = p1  p2  p3,  …
– pi is a Boolean predicate
• Composite subscriptions
– CS = R(S1, S2 , S3 , … Sm)
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• Contiguous event sequence
• No event can be skipped
• Non-contiguous event sequence
• Events can be skipped
• R: Operators
– Temporal operators:
• , : contiguous sequence
• ; : non-contiguous sequence
• @:explicit temporal operator
– Boolean operators:
• : conjunction
• : disjunction
11
Example
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
csintrusion = s1; ( ( s2;s3@(t3-t2<d) )  (s4,s5) );s6
s1: ip=$x  login=denied
s2: ip=$x  login=denied
s3: ip=$x  login=success
s4: ip=$x  login=success
s5: ip=$x  action=passwd
s6: ip=$x  action=logoff
csintrusion matched by {e0 , e1}, e2, e3,
e4
12
Problem Statement
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• Matching Problem
– Given a set of composite subscriptions, CS, and
an event stream, {ei}, find all cs = R(s1, s2, …, sn)
such there that exists {ej1,ej2,…, ejn}  {ei} and ej1
matches s1, … , ejn matches sn subject to R and all
time constraints are satisfied.
• Prediction Problem
– Find all partially matched cs such that
Prcs(full match | partial) > θcs
13
Required Matching Tasks
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• Composite subscription: s1; ( (s2;s3@(t3-t1<d) )  (s4,s5) );s6
• Primitive subscriptions, like si, matching single events (i.e.,
sets of attribute-value-pairs)
• Sequences of primitive subscriptions matching
consecutive and non-consecutive events in the input
• Boolean expressions, like term1  term2 above, matching
higher-level patterns of events
• Computation of probabilities to predict full matches given
partial matches
14
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Matching Engine
s1; ( (s2;s3@(t3-t1<d))(s4,s5) );s6
Event stream
s3
Primitive subscription matches
Primitive
Subscriptions
Matcher
Primitive subscription matches
term1  term2
s2;s3
State
Machine
Engine
Derived events
Derived events
Partial
matches
Full matches
Prediction
Engine
Boolean
Expression Tree
Matcher
Partial
matches
Predictions
Full matches
(subscription, matching probability > θS)
15
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Algorithms for Matching Tasks
• Primitive Subscription Matcher
– BDD-based approach (our ICDCS’05 algorithm)
– Alternatively, our SIGMOD’01 algorithm or our new
indEX (fastest Boolean Expression Index in the
market)
• Boolean Expression Tree Matcher (state-based)
– Extension of the Rete algorithms as in-memory
event processing network (Forgy, 1982)
– For extensions & implementation , see our PADRES
code base at padres.msrg.org
16
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Algorithms for Matching Tasks
• State Machine Engine
– Based on evaluating finite state machines (FSMs)
– Combined with techniques to merge states to amortize
processing of similar subscriptions
– Combined with algorithms and data structures to track
time conditions
• Prediction Engine
– Based on training and evaluating a Markov model
• Trained on past events
• Evaluation over event stream
17
State Machine Engine
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• State machine creation
• State machine evaluation
18
Example: F, F, F @(tN3-tN1<d), S
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Contiguous sequence operator
We abstract for ease of presentation
• F represents a primitive subscription that evaluates to true for a failed
login
• S represents a primitive subscription that evaluates to true for a successful
login
• Index in time constrain refers to position (state) in the subscription (FSM)
F
N0
N1
(F)
F
N2
(F,F)
F
@(tS3-tS1<d)
N3
(F,F,F)
S
N4
(F,F,F,
S)
t
Time of the most recent transition into the state
• Explicit temporal operator treated as another predicate to be
evaluated over transition times tracked for all states
19
t1
t2 t 3
Event stream
time
S = F, F, F @(tN3-tN1<d), S
Contiguous sequence operator
F F F
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Current state
At t1
F
F
N1
(F)
N2
(F,F)
t1
F
At t2
F
N1
(F)
N2
F
(F,F,F)
F
@(tS3-tS1<d)
(F,F)
(F)
N2
(F,F)
t3
t3
N3
S
(F,F,F)
t3
F
N1
S
N3
t2
t2
At t3
F
@(tS3-tS1<d)
F
@(tS3-tS1<d)
N3
S
(F,F,F)
20
Example: F; S1; F; S2@(tS2-tS1<T)
Non-contiguous sequence operator
Self link
Triggered for every event
except those that trigger
primary & secondary links.
F
N0
*
N1
*
S
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
*
F
N2
N3
(F)
(F;S)
(F;S;F)
F
S
F
S@T
N4
(F;S;F;
S@T)
Primary link
First transition into state
Secondary link
Continued matching of primitive subscription that led to
the transitioning into this state.
• Events not contributing to matching a subscriptions are allowed to
occur (must remain in current state; achieved via self-links)
• Upon a match of the next primitive subscription
• Time conditions are checked, if any
• Transition times are updated
• Transition times are only tracked for primary & secondary links
21
t1
t7
t4
time
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
S1; S2 @T1; S3 @T2 @ T3
S1 S1 S1 S2 S2 S2 S3 S3
S2  not(T1)
not(S2)
N0
(S1)
S1
S3  ( not(T2) 
not(T3) )
not(S1)
not(S3)
N1
N3
N2
S2
@T1
T1 : (tS2-tS1 < 3)
T2 : (tS3-tS1 < 6)
T3 : (tS3-tS2 > 3)
(S1;S2)
S1
(S1;S2;S
3)
S3
@T3
@T2
not(S1)
Time(S1):
S1:t1
S1:t2
S1:t3
S2 : t4
Time(S2):
S2:t4 Tc(S1) = {t2, t3}
S2:t5 Tc(S1) = {t3}
S2 : t5
S2 : t6
S3 : t7
Time(S3):
Tc(S2) = {t4}
S3:t8
Tc(S1) = {t3}
S3 : t8
22
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Merging State Machines
Two states N1 and N2 are equivalent iff:
1. The number of incoming transitions of N1 and N2 are equal.
2. Any incoming transitions arrive from equivalent states and are triggered
by the same set of events. Initial states are equivalent.
N0
N0
a
*
N1
Merged:
b
a
(a)
c
b
N2
N3
(a;b,c)
(a;b)
(a)
*
N1
N2
d
(a;b)
N3
(a;b,d)
a
N0
N1
(a)
• a; b; c
• a; b; d
•a
(a)
b
M2
c
(a;b)
d
M0
a
a
*
M1
M3
(a;b,c)
M4
(a;b,d)
N5
(a)
23
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Markov Model for Prediction
• FSMs record incremental matches of subscriptions
• Probability of transitioning to next state for a given
event depends only on current state
• Our FSMs are Markov processes
• Our prediction algorithm uses the properties of
Markov processes to predict future matches based
on current state and event history
– Probability of reaching the final state in n events
– … of reaching final state in the next 1, 2, 3, … n events
24
Prediction & Training
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• Compute long-run transition probability of reaching a
given state
• Based on the input (event history), we count the # times
transition
number of times transitions are taken
taken
• Based on counters, we compute transition
probabilities of the model
• Transition probability from state i to j is
• Complete Markov chain with finite state space
• pij = Pr(Xn+1 = j| Xn = i)
all
– Conditional probability of transitioning to j given i
incoming
transitions
25
Experiments
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• Synthetic workload
• Real data set
26
Effect of Number of Subscriptions
Number of states
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Average matching time per event
Gaussian
More sharing
Less sharing
Uniform
• Merging reduces number of states by
up to 30% for given data set
• Number of states increases linearly in
number of subscriptions
• More states are required for workloads
with less state sharing potential
• Matching time increases in the number
of subscriptions
• More sharing requires more processing
as a given event may trigger more
transitions
27
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Effect of Number of Non-contiguous Operators
• Matching time
increases in number
of non-contiguous
operators
• More and more
subscription instances
are partially matched
waiting for events
• Asks for a garbage
collection scheme
Average matching time per event
28
Experiments on Synthetic Workload
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• Precision decreases as look-ahead increases
• Precision increases as prediction-threshold increases
and stabilizes for large thresholds
29
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
Expert Model (full) vs. Learned Model
Precision defined as True positives / All predictions
Result: With increasing look-ahead learned model results
in higher precision.
Full model (about 1400 states)
Learned model (5 states)
30
Conclusions
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
• P-ToPSS is a new publish/subscribe model for
event stream processing
• Predicts the probability a subscription will match in
the future
• Performs traditional publish/subscribe matching
• Supports state-based, temporal and Boolean
operators over predicates (complex subscriptions)
• Based on Markov chains for prediction
• Prediction performance of learned model is
better than hand-crafted model in our
experiments
31
MIDDLEWARE SYSTEMS
RESEARCH GROUP
msrg.org
32