Novelty Detection & One-Class SVM (OCSVM)

Download Report

Transcript Novelty Detection & One-Class SVM (OCSVM)

Novelty Detection
&
One-Class SVM
(OCSVM)
1
Outline
Introduction
Quantile Estimation
OCSVM – Theory
OCSVM – Application to Jet Engines
2
Novelty Detection is
An unsupervised learning problem
(data unlabeled)
About the identification of new or
unknown data or signal that a machine
learning system is not aware of during
training
3
Example 1
“Normal”
“Novel”
“Novel”
“Novel”
4
So what’s seems to be the
problem?
It’s a 2-Class problem.
“Normal vs. “Novel”
5
The Problem is
That “All positive examples are
alike but each negative
example is negative in its own
way”.
6
Example 2
Suppose we want to build a classifier that recognizes
web pages about “pickup sticks”.
How can we collect a training data?

We can surf the web and pretty easily assemble a sample to be
our collection of positive examples.
What about negative examples ?

The negative examples are… the rest of the web. That is
~(“pickup sticks web page”)
So the negative examples come from an unknown # of
negative classes.
7
Applications
Many exist







Intrusion detection
Fraud detection
Fault detection
Robotics
Medical diagnosis
E-Commerce
And more…
8
Possible Approaches
Density Estimation:


Estimate a density based on training data
Threshold the estimated density for test points
Quantile Estimation:


Estimate a quantile of the distribution underlying the
training data: for a fixed constant   (0,1], attempt
to find a small set S such that Pr( x  S )  
Check whether test points are inside or outside S
9
Quantile Estimation (QE)
A quantile function with respect to
is defined as :
( P,  , H )
U (  )  inf{  (C ) | P(C )   , C  H }
0   1
H - a class of measurable subsets of X
 - a real valued function.  : H  
C ( ) denotes the C  H that attains the
infimum
10
Quantile Estimation (QE)
The empirical quantile function is defined as
above where P is the empirical distribution:
m
1
m
Pemp
(C )   I C ( xi )
m i 1
C (  ) - denotes the C  H
m

that attains the
infimum on the training set.
Thus the goal is to estimate
C ( )
C (  )
through
m

11
Quantile Estimation
Choosing Intelligently H and  is important
On one hand large class H  many small
sets that contain a fraction  of the training
examples.
On the other hand, if we allowed just any
set, the chosen set could consist of only
the training points  poor generalization
12
Complex vs. Simple
m
Pemp
(C)  1
m
Pemp
(C)  1
13
Support Vector Method for
Novelty Detection
Bernhard Schölkof, Robert Williams, Alex
Smola, John Shawe-Taylor, John Platt
14
Problem Formulation
Suppose we are given a training ample drawn
from an underlying distribution P
We want to a estimate a “simple” subset S
such that for a test point x drawn from the
distribution P, Pr( x  S )   ,  (0,1]
X
We approach the problem by trying to estimate a
function f which is positive on S and negative on
the complement
15
The SV Approach to QE
The class H is defined as the set of halfspaces in a feature space F(via kernel k)
 (Cw ) || w ||2 , where
Cw  {x | f w ( x)  }
Here we define
(w,  )are respectively a weight vector
and an offset parameterizing a hyperplane
in F
16
“Hey, Just a second”
If we use hyperplanes & offsets,
doesn’t it mean we separate the
“positive” sample? But, separate
from what?
From the Origin
17
OCSVM
 / || w ||
w
(x)
 / || w ||
18
OCSVM
Serves as a penalizer like “C” in the 2-class
svm
(recall thatthe
To separate
data
0
 set
1 )from the origin we
solve the following quadric program:
1
1
|| w || 




i
min
i
m
wF , i R , R 2
subject to
 w, ( x)    i


0
Notice that no “y”s
are incorporated in the
i
constraint since there are no labels
19
OCSVM
The decision is therefore:
f ( x)  sgn(  w, ( x)   )
Since the slack variables  i are penalized in the
objective function, we can expect that if w and
 solve the problem then f will equal 1 for
most example in the training set, while || w || still
stays small
20
OCSVM
Using multipliers  i , i 0 we get the Lagrangian,
1
1
2
L(w, ξ,  , α, β)  || w || 
i   

2
m i
  i ( w, (x)    i )    ii
i
i
21
OCSVM
Setting the derivatives of L w.r.t w , ξ ,  to 0
yields:
1)
w    i ( xi )
i
2)
1
1 ,
i 
 i 
m
m

i
1
i
22
OCSVM
Eq. 1 transforms f (x ) into a kernel expansion:


f (x)  sgn    i k (xi , x)   
 i

The offset eq.
can1be
by exploiting
Substituting
& 2recovered
into L yields
the dual
1
that
for any 0   i 
the corresponding
problem:
1
m
x
 α k (x , x )

2
   w,  (x )  1 k (x , x )
subject to 0   
,   1
pattern
i
satisfies:
min
αR
i
m
j
i
j
ij
i
j
i
m
j
j
i
i
i
23
 - Property
Assume the solution of the primal problem
satisfies  . 0 The following statements hold:



 is an upper bound on the fraction outliers.
 is a lower bound on the fraction SVs

With probability 1, asymptotically,
equals both the
fraction of SVs and the fraction of outliers. (under
certain conditions of P(x) and the kernel)
24
X axis – svm magnitude
y axis – frequency

Results – USPS (“0”)
For = 50%, we get:
50% SVs
49% Outliers

For = 5%, we get:
6% SVs
4% Outliers
25
OCSM - Shortcomings
Implicitly assumes that the “negative” data
lies around the origin.
Ignores completely “negative” data even if
such data partially exist.
26
Support Vector Novelty Detection
Applied to Jet Engine Vibration
Spectra
Paul Hyton, Bernhard Schölkof,
Lionel Tarassenko, Paul Anuzis
27
Intro.
Jet engines have pass-off tests before they can
be delivered to the customer.
Through vibration tests an engine’s “vibration
signature” can be extracted
While normal vibration signatures are common,
we may be short of abnormal signatures.
Or even worse, the engine under test may show
up a type of abnormality which has never been
seen before.
28
Feature Selection
A vibration gauges are attached to the
engine’s case
The engine under test is slowly accelerated
from idle to full speed and decelerated back
to idle
The vibration signal is then recorded
The final feature is calculated over a
weighted average of the vibration for 10
different speed ranges
Thus yielding a 10-D vector
29
Algorithm
Slightly more general than the regular
OCSVM
In addition to the “normal” data points
X  {x1 ,..., x m } we take into account
some abnormal points Z  {z1 ,..., z t }
Rather than separating from the origin we
separate from the mean of Z
30
Primal Form
1
1
|| w || 




i
min
i
m
wF , i R , R 2
subject to
1
 w, (xi   z )     i
t n
i  0
and the decision function is
1
f (x)  sgn(  w, (x   z n )   )
t n
31
Dual Form
1
minm   i α j (k (x i , x j )  q  q j  qi )
αR 2
ij
where
1
q 2
t
1
k (z n , z p ) and q j   k (x j , z n )

t n
np
subject to
1
,
0  i 
m

i
1
i
32
2D Toy Example
33
Training Data
99 Normal Engines were used as training
data
40 Normal Engines were used as
validation data
23 Abnormal Engines used as test data
34
Standard OCSM Results
35
Modified OCSM Results
36