Multiresolution Vector Quantized approximation (MVQ)

Download Report

Transcript Multiresolution Vector Quantized approximation (MVQ)

Overview of Anomaly Detection in
Time Series Data
LÊ VĂN QUỐC ANH
Outline
 Introduction
 Anomaly detection approaches
 Classification based
 Nearest Neighbor Based
 Predictive
 Window-Based
 Disk Aware Discord Discovery
 And others approaches
Comments
 Conclusion
 References
2
Introduction
Time series data problems:
 Similarity search
 Classification
 Clustering
 Motif discovery
 Anomaly/novelty detection
 Visualization
* [Keogh]
3
Introduction
Time series data problems:
 Similarity search
 Classification
 Clustering
 Motif discovery
 Anomaly/novelty detection
 Visualization
* [Keogh]
4
Problem Definition
Anomaly/novelty detection refers to the problem of
finding patterns in data that do not conform to
expected behavior
5
Problem Definition (cont.)
 Finding discords in large scale time series
[V. Chandola]
6
Applications
 Intrusion detection for cyber-security
 Fraud detection for credit cards
 Fault detection in safety critical systems
 Industrial damage detection
 Medical and public health anomaly detection
 Stock market analysis
…
7
Very simple technique:
Match the data to known patterns
8
Existing anomaly detection techniques
 Classification based
 Nearest Neighbor Based
 Predictive
 Window-Based
 Disk Aware Discord Discovery
 And others techniques
9
Classification based approaches
 Learn a model from a set of labeled data instances
and then, classify a test instance into one of the
classes using the learnt model
 Operate in two phases:
 training phase: learning from trainning data
 testing phase: test instance as normal or anomalous
 Assumption: A classifier that can distinguish
between normal and anomalous classes can be
learnt in the given feature space.
10
Classification based approaches(cont.)
11
Classification based approaches(cont.)
 Some techniques:
 Neural Networks based
 Bayesian Networks based
 Support Vector Machines based
 Rule based
12
Classification based approaches(cont.)
 Advantages:
 can distinguish between instances belonging to
different classes
 testing phase is fast
 Disadvantages:
 have to assign a label to each test instance
 rely on availability of accurate labels for various normal
classes
13
Nearest Neighbor Based
 Assumption: Normal data instances occur in dense
neighborhoods, while anomalies occur far from their
closest neighbors.
 require a distance defined between two data
instances
14
Nearest Neighbor Based(cont.)
15
Nearest Neighbor Based(cont.)
 Advantages:
 purely data driven
 Disadvantages:
 if the data has normal instances that do not have
enough close neighbors or if the data has anomalies
that have enough close neighbors, the technique fails
to label them correctly
 performance greatly relies on a distance measure
 defining distance measures between instances can be
challenging when the data is complex
16
Predictive techniques
 Forecast the next observation in the time series,
using the statistical model and the time series
observed so far, and compare the forecasted
observation with the actual observation to determine
if an anomaly has occurred.
 Some techniques: Regression, Auto Regression
ARMA, ARIMA, SVR (Support Vector Regression)
17
Predictive techniques(cont.)
 Advantages:
 provide a statistically justifiable solution for anomaly
detection if the assumptions regarding the underlying
data distribution hold true
 Disadvantages:
 rely on the assumption that the data is generated from
a particular distribution
18
Window-Based
 Extract fixed length (w) windows from a test time
series, and assign an anomaly score to each
window. The per-window scores are then
aggregated to obtain the anomaly score for the test
time series.
 Some proposed techniques:
 HOT SAX
 AWDD
 WAT
19
HOT SAX
 [Eamonn Keogh,Jessica Lin, Ada Fu]
 Finding the most unusual time series subsequence
 discord
 Improve BFDD algorithm (Brute Force Discord
Discovery) with heristic ordering
 Use SAX for discretization
20
Brute
Force
Algorithm
21
Heuristic
Discord
Discovery
22
The two data structures for Inner and Outer heuristics
[Keogh]
23
AWDD technique
 M. Chuah, F. Fu (2006)
 AWDD - Adaptive Window Based Discord Discovery
 Apply for ECG time series
24
AWDD technique(cont.)
 Advantages:
 use adaptive rather than fixed windows
 Disadvantages:
 deal only with ECG datasets
25
WAT technique
 Y. Bu et al (2006)
 WAT - Wavelet and Augmented Trie
 Employs Haar wavelet transform and symbol word
mapping orderly on raw time series to build prefix
tree for Inner and Outer loop heuristic
 can view a subsequence in different resolutions
 the first symbol of each word gives us the lowest
resolution for each subsequence
26
WAT technique(cont.)
 Advantages:
 require 2 parameter (1 intuitive parameter)
 better performance than HOT SAX
 Disadvantages:
 assume the coefficients are in Gaussian distribution
 assume that the data reside in main memory
27
DADD technique
 DADD - Disk Aware Discord Discovery (2008)
[Yankov, Keogh and Rebbapragada]
 Finding unusual time series in terabyte sized
datasets on secondary memory
 Algorithm has two phases:
 Phase 1: a candidate selection phase
 given a threshold r , finds a set of all discords at distance at
least r from their nearest neighbor
 Phase 2: a discord refinement phase
 remove all false discords from the candidate set
28
A candidate selection phase
procedure [C]=DC Selection(S, r)
in:
S: disk resident data set of time series
r: discord defining range
out: C: list of discord candidates
C = {S1}
1
for i = 2 to |S| do
2
isCandidate = true
3
for ∀Cj ∈ C do
4
if (Dist(Si,Cj) < r) then
5
C = C \ Cj
6
isCandidate = false
7
end if
8
end for
if (isCandidate) then
C = C ∪ Si
end if
9
10
11
12
13
end for
29
A discord refinement phase
procedure [C,C.dist]=DC Refinement(S, C, r)
in:
S: disk resident dataset of time series
C: discord candidates set
r: discord defining range
out: C: list of discords
C.dist: list of NN distances to the discords
1
for j = 1 to |C| do
2
C.distj = ∞
3
end for
4
for ∀Si ∈ S do
5
for ∀Cj ∈ C do
6
if Si == Cj then
7
continue
8
end if
9
d = EarlyAbandon(Si,Cj ,C.distj)
10
if (d < r) then
11
C = C \ Cj
12
13
14
15
16
17
C.dist = C.dist \ C.distj
else
C.distj = min(C.distj , d)
end if
end for
end for
30
DADD technique (cont.)
 Advantages:
 equires only two linear scans of the disk with a tiny
buffer of main memory
 very simple to implement
 Disadvantages:
 depend on threshold r
31
Proposed approach
 Using Vector Quantization for
discretization
 Improve BFDD algorithm with ordering heuristic
32
Using histogram model
Codebook s=16
Generation
Series
Transformation
cmdbca
ifaj bb
minj j a
ma I n j m
hldf ko
phcako
ogcbl p
occbl h
l hnkkk
pl cacg
k k g j h h……
gkgj lp
Series
Encoding
1121000000001000
1200010011000000
1000000012001100
1000000011002100
0001010100110010
1010000100100011
……
33
Similarity measure
1
S HM (q, t ) 
1  dis(q, t )
with
s
f i ,t  f i , q
i 1
1  f i ,t  f i , q
dis(q, t )  
fi,t
fi,q
1 2...s
34
Using multiple resolutions
•
Codebook (6,60)
•
Codebook (16,30)
35
For each resolution
 Start with lowest resolution and a group of all
subsequences
 For each resolution
 groups which have more than one subsequences are
splitted based on a threshold r
 Stop when have groups with one subsequences or
reach the highest resolution
36
Improve BFDD
 Outer Loop Heuristic:
 groups which have smallest subsequences count are
considered first
 Inner Loop Heuristic:
 when ith subsequence is considered in the outer loop,
all subsequences in the same group are considered
first in the Inner Loop
37
References
 [1]
E. Keogh, J. Lin, W. Fu. HOT SAX: Efficiently Finding the Most
Unusual Time Series Subsequence. In Proc. of the 5th IEEE International
Conference on Data Mining (ICDM 2005), November 27-30, 2005, pp. 226233.
 [2]
D. Yankov, E. Keogh, U. Rebbapragada, Disk Aware Discord
Discovery: Finding Unusual Time Series in Terabyte Sized Datasets, 2008
 [3]
E. Keogh.Mining Shape and Time Series Databases with Symbolic
Representations. Tutorial of the 13rd ACM Interantional Conference on
Knowledge Discovery and Data Mining (KDD 2007), August 12-15, 2007.
 [4]
J. Lin, E. Keogh, A. Fu, and H. Van Herle, Approximations to Magic:
Finding Unusual Medical Time Series, the 18th IEEE International
Symposium on Computer-Based Medical Systems, pp. 329-334, 2005.
 [5]
M. Chuah and F. Fu, ECG anomaly detection via time series analysis,
Technical Report LU-CSE-07-001, 2007.
38
References (cont.)
 [6]
V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos. A Multiresolution
Symbolic Representation of Time Series. In Proc. of the 21st International
Conference on Data Engineering (ICDE 2005), April 5-8, 2005, pp. 668-679,
2005.
 [7]
V. Chandola, D. Cheboli, and V. Kumar, Detecting Anomalies in a
Time Series Database,Technical Report TR 09-004, 2009.
 [8]
Y. Bu, T-W Leung, A. Fu, E. Keogh, J. Pei, and S. Meshkin, WAT:
Finding Top-K Discords in Time Series Database, in Proc. of the 2007 SIAM
International Conference on Data Mining (SDM'07), Minneapolis, MN, USA,
April 26-28, 2007.
 [9]
Q. Wang, V. Megalooikonomou, A dimensionality reduction technique
for efficient time series similarity analysis, Information Systems 33, 115–
132, 2008.
 [10]
H. B. Kekre Tanuja K. Sarode, Fast Codebook Search Algorithm for
Vector Quantization using Sorting Technique , International Conference on
Advances in Computing, Communication and Control (ICAC3’09), 2009.
39
Thank you!
40