Anomaly Detection Through a Bayesian SVM

Download Report

Transcript Anomaly Detection Through a Bayesian SVM

Anomaly Detection Through a Bayesian SVM
Vasilis A. Sotiris
AMSC 664 Final Presentation
May 6th 2008
Advisor: Dr. Michael Pecht
University of Maryland
College Park, MD 20783
Objectives
• Develop an algorithm to detect anomalies in electronic systems (large
multivariate datasets)
• Perform detection in the absence of negative class data – One – Class
Classification
• Predict future system performance
• Develop application toolbox – CALCEsvm to implement a proof of
concept on simulated and real data:
– Simulated degradation
– Lockheed Martin Data-set
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
2
Motivation
• With increasing functional complexity of on-board autonomous systems,
there is an increasing demand for system level:
– Health assessment,
– Fault diagnostics
– Failure prognostics
• This is of special importance for analyzing intermittent failures, some of
the most common failure modes in today’s electronics
• There is a need for efficient and reliable prognostics for electronic systems
using algorithms that can:
–
–
–
–
7/16/2015
fuse sensor data,
discriminate false alarms from actual failures
correlate faults with relevant system events
and reduce redundant processing elements which are subject to common mode
failures
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
3
Algorithm Objectives
• Develop a machine learning approach to:
– detect anomalies in large multivariate systems
– detect anomalies in the absence of reliable failure data
• Mitigate false alarms and intermittent faults and failures
• Predict future system performance
x2
Distribution of fault/failure data
Fault Space ?
Distribution of training data
x1
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
4
Data Setup
• Data is collected at times Ti from a multivariate distribution of random
variables x1i…xmi
– x’s are the system covariates
• Xi’s are independent random vectors
• Class  {-1,+1}
• Class probability = p(class|X)
estimate
given
X
X1
X2
Xn
7/16/2015
Ti x1 x2 x3
T1 x11 x21 x31
T2 x12 x22 x32
Tn
x1n x2n x3n
… xm
Class
… xm1
1
… xm2
1
…
xmn
-1
Class
Probability
0.95
0.96
0.45
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
5
Data Decomposition (Models)
• Extract features from the data by
constructing lower dimensional models
• X – training data  Rnxm
• Singular Value Decomposition (SVD)
X  UV T
H  UkUk
T
Sample
Observation
x
[M]
G  I  U kU k
[R]
xR
xM
T
• With H project data onto [M] and [R] models
– k: number of principal components (k=2)
– xM: the projection of x onto the model space [M]
– xR: projection of x onto the residual space [R]
x  xM  xR
7/16/2015
x  xH  xI  H 
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
6
Two Class Support Vector Machines
F2
D(x)
58
x2
58
56
56
54
54
-1
+1
50
F3 Feature Space
48
46
F1
D(x)
52
X2
X2
52
x2
50
48
46
44
44
42
42
44
46
48
50
52
x1
Input Space
54
56
x1
42
42
44
46
48
50
x1
52
54
Input Space
n
56
x1
D( x)   wi xi  b  w x  b
T
i 1
•
•
•
•
Given: nonlinearly separable labeled data xi with labels yi  {+1,-1}
Solve linear optimization problem to find w and b in the feature space
Form a nonlinear decision function my mapping back to the input space
The result is that we can obtain a decision boundary on the given training
set and use it to classify new observations
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
7
Two Class Support Vector Machines
•
•
Interested in a function that best separates two classes of data
The margin M=2/||w|| can be maximized by minimizing ||w||
– the learning problem is stated as:
x2
1 2 1
min! w  wT w
2
2
Negative Class
M
– subject to:
w
yi (wT xi  b) 1  0 for i  1,..., n
D(x)=0
•
Positive Class
The classifier function D(x) is constructed
with appropriate w and b (distance origin to D(x))
x1
D( x)  wT x  b
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
8
Two Class Support Vector Machines
•
1 2 n
Lagrangian function: L(w, b, a )  w  a i  yi (wi xi  b)  1
2
i 1
n
KKT conditions
w   a i yi xi
i 1
n
D( x)   yia i xi x  b
T
i 1
• Instead of minimizing LP w.r.t. to w and b, minimize LD w.r.t to α
1 T
L
a

a Ha  pT a
min! D  
2
where H is the Hessian Matrix,
Hi j = yi yj xiT xj
a=[a1,…,an]
and p is a unit vector
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
9
Two Class Support Vector Machines
• In the nonlinear case use kernel function F centered at each x
n
n
D( x)   wi xi  b   yia i Fxi  Fx   b
i 1
T
i 1
x2
Distribution of
fault/failure data
D(x)=-1
D(x)=0
D(x)=+1
• Form the same optimization problem
1
min! LD a   a T Ha  f T a
2
where
H  yi y j Fxi  Fx j   yi y j k ( xi , x j )
Support Vectors
T
Distribution of
training data
x1
• Argument: the resulting function D(x) is the best classifier for the given
training set
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
10
Bayesian Interpretation of D(x)
• The classification y  {-1.+1} for any x, is equivalent to asking p(Y=+1 |
X=x) > ? < p(Y=-1 | X=x)
p y  1 | x  
p( y  1, X  x)
P( X  x)
p y  1 | x   1  p y  1 | x 
• An optimal classifier yMAP maximizes the conditional probability:
 1, if p( y  1 | X  x)  0.5

yMAP  arg max pY  a | X  x   
a  1, 1
  1 if p( y  1 | X  x)  0.5
• Quadratic optimization problem D(x)
• It can be shown that D(x) is the maximum a posteriori (MAP) solution to
P(Y=y|X=x)  P(class|data), and therefore the optimal classifier of the
given two classes

D( x)  yMAP
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
11
One Class Training
• In the absence of negative class data (fault or failure information), a oneclass-classification approach is used
• X=(X1, X2) ~ bivariate distribution
x2
xi
L
x1
X1
X2
• Likelihood of positive class L= p(X=xi|y=+1)
• Class label y  (-1,+1)
• Use the margin of this likelihood to construct the negative class
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
12
Nonparametric Likelihood Estimation
•
If the probability that any data point xi falls into the kth bin is r, then the
probability of a set of data {x1,…,xm} falling into the kth bin is given by a
binomial distribution:
n
nm
Px1 ,...,xm | r     r m 1  r 
 m
–
–
–
Total sample size: n
Number of samples in kth bin: m
Region defined by bin: R
rˆ 
•
MLE of r
•
Density estimate
m
n
m
fˆ  x  
nV
# samples in R
fR(x) =
total # Samples
7/16/2015
1
volume (R)
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
13
Estimate likelihood: Gaussian kernel j
• The volume of R: V  h
n
• For uniform kernel the number of data m in R: m   j  x  xi 
d
i 1
– Kernel function: f

h 
u2
1 2
j u  
e
2
x
• Points xi which are close to the sample point x receive higher weight
– Resulting density fj(x) is smooth n
fj  x  
m

nV
 x  xi 

 h 
nhd
j 
i 1
• The bandwidth h is selected according to a nearest neighbor algorithm
1 3
– Each bin R contains kn data
2
1
kn  n
7/16/2015
2
3
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
14
Estimate of Negative Class
• The negative class is estimated based on the likelihood of the positive class
(training data)
• A threshold t is used to estimate the likelihood ratio of positive to negative
class probability for the given training data
• A 1D cross-section of the density illustrates the idea of the threshold ratio:
P( X11= x1,…,X1n= xn| y1,…yn= +1 )
Positive
Negative
7/16/2015
X1
X2
Ti
T1
T2
x1
x11
x12
x2
X21
x22




Xn
Xn+1
Xn+2
Tn
Tn+1
Tn+2
x1n
x1n+1
x1n+2
X2n
x2n+1
X2n+2




Xn+k
Tn+k
x1n+k
x2n+k
t
Negative Class
[N]
x1
x1n+2 x1n+1
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
15
D(x) as a Sufficient Statistic
• D(x) can be used as a sufficient statistic to classify data point x
P y  1 | X  x   P y  1 | Dx 
• Argument: since D(x) is the optimal classifier, posterior class probabilities
are related to data’s distance to D(x)=0
• These probabilities can be modeled by a logistic distribution, centered at
D(x)=0
D(x)
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
16
Posterior Class Probability
• The positive posterior class probability is given by:
P y  1 | xi  
PY  1, X  xi 
P X  xi 

P X  xi | Y  1PY  1
 P X  x | Y  a PY  a 
a  1, 1

1
logistic distribution
a i
1 e
i
where a i  log
P X  xi | Y  1PY  1
P X  xi | Y  1PY  1
• Use D(x) as the sufficient statistic for the classification of xi, by replacing
ai by D(xi)
P y  1 | X  xi   P y  1 | Dxi   pi
• Simplify
pi  P y  1 | X  xi  
• Get MLE for parameters A and B
7/16/2015
1
1  e  AD( xi )  B 
pˆ i MLE
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
17
Joint Probability Model
• Interested in P = P(Y|XM,XR), the joint probability of classification given
two models:
– XM: model space [M]
– XR: residual space [R]
• Assume XM, XR independent
• After some algebra get the joint positive and negative posterior class
probabilities P(+) and P(-):
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
18
Case Studies
I.
II.
7/16/2015
Simulated degradation
Lockheed Martin Dataset
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
19
Case Study I –Simulated Degradation
• Given:
10
– Simulated correlated data
– X1 = gamma, X2 = student t,
X3 = beta
8
X1
6
4
Simulated Dependent
x1 Gamma, t and beta Values (Changes in sample means)
• Degradation modeling
2
– Period of healthy data
– Three successive periods of
increasingly larger changes in the
mean for each parameter
– First with a probability close to 1
– For the three successive a
decreasing trend
7/16/2015
-2
0
50
100
20
150
200
250
Observation #
300
350
400
Observation
15
X3 ~ Beta(2,2)
• Expecting a posterior
classification probability to reflect
these four periods accordingly
0
10
5
-5
0
0
-5
12
10
8
6
4
2
0
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
X1 ~ Gamma(2,1)
-2
5
X2 ~ t(5)
20
Case Study I Results – Simulated Degradation
• Results: a plot of the joint positive classification probability
Joint positive posterior classification probability
1
0.9
0.8
Probability
0.7
0.6
0.5
P1
0.4
P2
P4
0.3
P3
0.2
0.1
0
0
50
100
150
200
250
300
350
400
Observation number
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
21
Case Study II – Lockheed Martin Data
(Known Faulty Periods)
• Given: Data set from Lockheed martin
–
–
–
–
Type of data: server data, unknown parameters
Multivariate, 22 parameters, 2741 observations
Healthy period (T): observations 0 - 800
Fault periods: observations F1: 912 – 1040, F2: 1092 – 1106, F3: 1593 - 1651
• Training data constructed with sample from period T, with size n=140
• Goal:
– Detect onset of known faulty periods without the knowledge of “unhealthy”
system characteristics
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
22
Case Study II - Results
Joint Posterior Class Probability Estimate
1
0.9
0.8
Probability
0.7
0.6
0.5
0.4
0.3
Period F1
Period T
0.2
Period F2
0.1
0
0
50
100
150
200
250
300
350
Observation
800
7/16/2015
912
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
23
Comparison Metrics of Code Accuracy
(LibSVM vs CALCEsvm)
• An established and commercially used C++ SVM software (LibSVM) was
used to test the accuracy of the code
• LibSVM features used: two class SVM
– does not include classification probabilities for one class SVM
• Input to LibSVM:
– Positive class: same training data
– Negative class: estimated negative class data from CALCEsvm
• Metrics: detection accuracy:
– The count of correct classifications based on two categories:
• Classification label y
• Correct classification probability estimate
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
24
Detection Accuracy LibSVM vs CALCEsvm
(Case Study 1 – Degradation Simulation)
•
•
•
Description of test:
– Period 1 should be captured with
a probability estimate ranging
from 80% to 100% positive class
– Period 2 equivalently between
70% and 85%
– Period 3 between 30% and 70%
– Period 4 between 0 and 40%
Based on just the class index, the
detection accuracy for both
algorithms was almost identical
Based on ranges of probabilities
LibSVM performs better in
determining the early stages where
the system is healthy, but performs
worse is detecting degradation in
comparison to CALCEsvm
7/16/2015
Libsvm Output
P1
P2
P3
P4
Based on
Based on
Class Index Probability Ranges for probability
1
1
0.8
1
0.038
0.144
0.7
0.85
0.29
0.46
0.3
0.7
0.88
0.79
0
0.4
CALCEsvm output
Based on Based on Ranges for
class index probability probability
1
0.8125
0.8
1
P1
P2 0.038
0.317
0.7
0.85
P3
0.29
0.31
0.3
0.7
P4
0.88
0.84
0
0.4
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
25
Detection Accuracy LibSVM vs CALCEsvm (Case Study 2 –
Lockheed Data)
•
Description of test:
–
–
•
•
•
Based on the class index, both LibSVM
and CALCEsvm perfrom almost
identically, with small improved
performance for CALCEsvm
Based on acceptable probability estimates,
LibSVM:
–
–
•
The acceptable probability estimate for a correct
positive classification should lie between 80 and
100%
Similarly the acceptable probability estimate for a
negative classification should not exceed 40%
does a poor job at identifying the healthy state
between each successive faulty period
Has a much better performance at detecting the
anomalies
Libsvm Output model test
Detection
Detection
Accuracy based Accuracy based Ranges of
on class index on probab.
probabilities
99.60%
30.50%
0.8
100.00%
0
1
0.4
Libsvm Output residual test
99.60%
30.50%
100.00%
CALCEsvm output
100.00%
98.10%
100.00%
CALCEsvm:
–
Seems to perform overall much better, and
identifies correctly both base on index and
acceptable probability ranges the faulty and
healthy periods in the data
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
26
Summary
• For the given data, and on some additional data sets the CALCEsvm
algorithm has accomplished the objective
– Detected the time events for known anomalies
– Identified trends of degradation
• Comparison of its performance accuracy to LibSVM is at first hand good!
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
27
Backups
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
28
Dual Form of Lagrangian Function
• Dual form of the Lagrangian function, for the optimization problem in LD
space
n
1
2
LP (w, b, a )  w  ai  yi  wT xi  b   1
2
i 1
n
n
1 T
T
LP (w, b, a )  w w  ai yi  w xi  b   ai
2
i 1
i 1
through KKT
conditions:
n
w   a i yi xi
i 1
n
1 n
LD a   ai   aia j yi y j xiT x j
2 i , j 1
i 1
n
subject to:
a y
i 1
7/16/2015
i
i
0
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
29
Karush-Kuhn-Tucker (KKT) Conditions
• Optimal solution (w*, b*, α*) exists if and only if KKT conditions are
satisfied. In other words, KKT conditions are necessary and sufficient to
solve w, b and α in a convex problem
 LP (w, b, a )
0
w
 LP (w, b, a )
0
b
7/16/2015
(i)
(ii)
a i  yi (w T xi  b)  1  0 for i  1,..., n
(iii)
a i  0 for i  1,..., n
(iv)
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
30
Posterior Class Probability
• Interested in finding the maximum likelihood estimates for parameters A
and B
Aˆ MLE , Bˆ MLE  arg max p X 1...X k , c1...ck | A, B   p X 1...X k | A, B  pc1...ck | x1...xk , A, B 
A, B
 pc1...ck | x1...xk , A, B
• The classification probability of a set of test data X={x1,…,xk}, into c={1,0}
is given by a product Bernoulli distribution
k
pc1...ck | x1..xk    pi i 1  pi 
c
1ci
i 1
• Where pi is the probability of classification when c=1 (y=+1) and 1-pi is
the probability of classification when
c=0 (refers to class y=-1)
k
log pc1...ck | x1...xk    ci log pi  1  ci  log1  pi 
i 1
7/16/2015
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
31
Posterior Class Probability
• Maximize the likelihood of correct classification y for each xi (MLE):
• Determine parameters AMLE and BMLE from maximum likelihood equation
k
(above)
log pc1...ck | x1...xk    ci log pi  1  ci  log1  pi 
i 1
• Use AMLE and BMLE to compute p(i)MLE in
pi MLE  P y  1 | X  xi  
1
1  e  AMLE F ( xi )  BMLE 
• Where piMLE is the
– maximum likelihood estimator of the posterior class probability pi (due
to the invariance property of the MLE)
– best estimate for the classification probability of each xi
• Currently implemented is:
pi  P y  1 | X  xi  
7/16/2015
1
1  e  F ( xi ) 
AMSC 664 Final Presentation - Anomaly Detection Through a Bayesian SVM
32