Computing Correlation Anomaly Scores using Stochastic

Download Report

Transcript Computing Correlation Anomaly Scores using Stochastic

Tokyo Research Laboratory
Computing Correlation Anomaly Scores
using Stochastic Nearest Neighbors
Tsuyoshi (Tsuyo) Idé,
IBM Research, Tokyo Research Lab.
Spiros Papadimitriou, and Michail Vlachos
IBM T.J. Watson Research Center
| 2007/10/29 | ICDM 2007
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Outline
 Problem statement
 Neighborhood preservation principle
 Stochastic nearest neighbors
 Experimental results and summary
Page 2
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Problem statement
Page 3
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Problem statement (1/2):
We address a task of change analysis between two data sets
data set B
Problem 1 (change detection):
x2
Tell whether A and B are different
…
x1
Problem 2 (change analysis ):
xN
Page 4
…
data set A
Given A and B, tell which
variables are responsible for
the difference between them
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Problem statement (2/2):
We assume sensor signals of highly correlated and dynamic natures
data set A
data set B
Typical application
…
x2
Sensor validation (to identify
faulty sensors)
Challenges in real data
…
x1
• dependency between signals
xN
• highly dynamic nature
• heterogeneities
• no supervised information
Page 5
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Related work:
Highly dynamic and correlated natures make the problem difficult
Time-series alignment (or DTW)
[Berndt 94, Keogh 00, …]
• hard to handle highly
dynamic natures
?
Two-sample test
• capable of handling change detection
• but hard to do change analysis
…
…
[Friedman 79, Henze 88, Gretton 07, …]
PCA-based approach
[Papadimitriou 05, Idé 05, …]
• doesn’t work since no stable latent structure
in this case
 see Experiment
Page 6
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Neighborhood preservation principle
Page 7
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Our goal: to compute the anomaly score of each signal
data set A
data set B
anomaly score
reference data
test data
variable
t = (time index)
Page 8
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
t = (time index)
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Reducing the problem to graph comparison
data set A
data set B
anomaly score
reference data
test data
dissimilarity graph
variable
Problem
Which nodes are responsible for the
difference between the two graphs?
Simplest choice of dissimilarity:
x1
x2
..
x1
0
0.2
..
x2
0.2
0
..
..
..
..
..
Page 9
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
Correlation coefficient between
the i- and j-th signals
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Key observation:
Globally unstable, but locally stable
data set A

Global graph structure is unstable

data set B
due to highly dynamic nature
reference data
test data

dissimilarity graph
Highly correlated pairs are relatively
stable

even under dynamic fluctuation
 Neighborhood Preservation
Principle

Page 10
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
Under normal system operations,
“tightness” of highly correlated
pairs will be unchanged
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
High level overview of our approach:
We focus only on local structures of the graph
k-neighborhood
graphs
test
dissimilarity
graph
graph
decomposition
Evaluation
of tightness
Comparison to give
anomaly score
reference
Page 11
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Stochastic nearest neighbors
Page 12
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
The tightness is defined as the sum of coupling probabilities
• Imagine that graph edges are not static but stochastic
• The definition of tightness
Evaluation
of tightness
Comparison to give
anomaly score
• The anomaly score (E-score) is naturally given by*
* In fact, the algorithm has been designed to be symmetric between the two data sets. For detail, see the paper.
Page 13
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
is determined by utilizing a notion of stochastic neighborhood
 c.f. Hinton-Roweis 03
can be determined by solving the following problem:
“For a given # of edges, minimize the average dissimilarity within the neighborhood graph”
Minimum average
dissimilarity
Constant perplexity
(Hi: entropy)
Normalization
condition
 constant # of neighboring nodes
Solution:
where
This amounts to “softening” neighborhood graphs.
Page 14
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Experimental result and summary
Page 15
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
E-score clearly pinpointed faulty automobile sensors, which were very
hard to be detected by the human eye
 test data includes 3 faulty sensors
reference

due to mis-wiring error between x-, y-,
and z-axes of an acceleration sensor
test
x1
…
…
x2  Faulty ones were clearly identified
x61
Page 16
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Summary
 We formalize the task of change analysis
 We proposed the neighborhood preservation principle for change analysis
Page 17
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007
Tokyo Research Laboratory
Thanks !
Page 18
| 2007/10/29 | ICDM 2007 | Tsuyo Idé
© Copyright IBM Corporation 2007