ICDE05-qv.ppt

Download Report

Transcript ICDE05-qv.ppt

A Multiresolution Symbolic
Representation of Time Series
Vasileios Megalooikonomou1, Qiang Wang1, Guo Li1,
Christos Faloutsos2
1Temple
University, Philadelphia, USA
2Carnegie Mellon University, Pittsburgh, USA
Outline

Background

Methodology

Experimental results

Conclusion
Introduction
Time Sequence:
A sequence (ordered collection) of real values:
X = x1, x2,…, xn
……
Challenges:
• High dimensionality
• High amount of data
• Similarity metric definition
Introduction
Goal:
To achieve:
• High efficiency
• High accuracy
in similarity searches among time series and
in discovering interesting patterns
Introduction
Similarity metric for time series
• Euclidean Distance:
most common, sensitive to shifts
• Dynamic Time Warping (DTW):
improving accuracy, but time consuming O(n2)
• Envelope-based DTW:
improving time complexity, o(n)
Introduction
Similarity metric for time series
A more intuitive idea:
two series should be considered similar if
they have enough non-overlapping timeordered pairs of subsequences that are
similar (Agrawal et al. VLDB, 1995)
Introduction
Dimensionality reduction techniques:
• DFT: Discrete Fourier Transform
• DWT: Discrete Wavelet Transform
• SVD: Singular Vector Decomposition
• APCA: Adaptive Piecewise Constant Approximation
• PAA: Piecewise Aggregate Approximate
• SAX: Symbolic Aggregate approXimation
•…
Introduction
Suggested Solution:
Multiresolution Vector Quantized (MVQ) approximation
1) Uses a ‘vocabulary’ of subsequences
2) Takes multiple resolutions into account
3) Unlike wavelets partially ignores the
ordering of ‘codewords’
3) Exploits prior knowledge about the data
4) Provides a new distance metric
Outline: A Multiresolution Symbolic
Representation of Time Series

Background

Methodology

Experimental results

Conclusion
Methodology
A new framework (four steps):
• Create a ‘vocabulary’ of subsequences (codebook)
• Represent time series using codecords
• Utilize multiple resolutions
• Employ a new distance metric
Methodology
Codebook s=16
Generation
Series
cmdbca
minj j a
hldf ko
ogcbl p
l hnkkk
kkgj hh
Transformation
ifaj bb
ma I n j m
phcako
occbl h
pl cacg
gkgj lp
……
Series
Encoding
1121000000001000
1200010011000000
1000000012001100
1000000011002100
0001010100110010
1010000100100011
……
Methodology
Creating a ‘vocabulary’
Q: How to create?
Frequently appearing
patterns in
subsequences
A: Use Vector Quantization, in
particular, the Generalized Lloyd
Algorithm (GLA)
Produces a codebook based on two
conditions:
•Nearest neighbor Condition (NNC)
•Centroid condition (CC)
Output:
A codebook with s codewords
Methodology
Representing time series
X = x1, x2,…, xn
is encoded with a new representation
f = (f1,f2,…, fs)
(fi is the frequency of the i
th
codeword in X)
Methodology
New distance metric:
The histogram model is used to calculate
similarity at each resolution level:
1
S HM (q, t ) 
1  dis (q, t )
wit
h
s
f i ,t  f i , q
i 1
1  f i ,t  f i , q
dis (q, t )  
Methodology
Time series summarization:
• High level information (frequently appearing
patterns) is more useful
• The new representation can provide this kind of
information
Both codeword
(pattern) 3 & 5
show up 2 times
Methodology
Problems of frequency based encoding:
• It can not record the location of a subsequence
• It is hard to define an approximate resolution
(codeword length)
• It may lose global information
Methodology
Utilizing multiple resolutions:
Solution: encoding with multiple resolutions
Each resolution level will be complementary
to each other
Reconstruction of time
series using different
resolutions
Methodology
New distance metric:
For all resolution levels a weighted
similarity metric is defined as:
c
S HHM (q, d j )   w i * S HMi (q, d j )
i 1
Methodology
Parameters of MVQ
X
Original time series, X= x1,x2,…,xn of length n
X’
Encoded form of the original time series
X′=f′1,f′2,…,f′s
N
Number of time series in the dataset
n
Length of original time series
C
Codebook: a set of codewords {c1,…,ck,…, cs}
c
Number of resolution levels
s
Size of codebook
l
Length of codeword
Methodology
Parameters of MVQ
•Number of resolution levels
c = log (n / lmin) +1
lmin is the minimal codeword length
•Length of codeword (on i
th
level)
l = n / 2i-1
•Size of codebook
Data dependent. However, in practice, small
codebooks can achieve very good results
Outline: A Multiresolution Symbolic
Representation of Time Series

Background

Methodology

Experimental results

Conclusion
Experiments
Datasets

SYNDATA (control chart data): synthetic
CAMMOUSE: 3 *5 sequences obtained using the
Camera Mouse Program
 RTT: RTT measurements from UCR to CMU with
sending rate of 50 msec for a day

Experiments
Best Match Searching:
For a given query, time series within the same
class as the query (given our prior knowledge)
form the standard set (std_set(q) ), and the
results found by different approaches (knn(q) )
are compared to this set
The matching accuracy is defined as:
Accuracy 
| knn(q)  std_set(q) |
 100%
k
Experiments
Best Match Searching
SYNDATA
CAMMOUSE
Method
Weight
Vector
Accuracy
Method
Weight
Vector
Accuracy
Single
level
VQ
[1 0 0 0 0]
0.55
[1 0 0 0 0]
0.56
[0 1 0 0 0]
0.70
[0 1 0 0 0]
0.60
[0 0 1 0 0]
0.65
Single
level
VQ
[0 0 0 1 0]
0.48
[0 0 1 0 0]
0.44
[0 0 0 0 1]
0.46
[0 0 0 1 0]
0.56
[1 1 1 1 1]
0.83
[0 0 0 0 1]
0.60
[1 1 1 1 1]
0.83
MVQ
Euclidean
0.51
MVQ
Euclidean
0.58
Experiments
Best Match Searching
(a)
(b)
Precision-recall for different methods
(a) on SYNDATA dataset (b) on CAMMOUSE dataset
Experiments
Clustering experiments
Given two clusterings, G=G1, G2, …, GK (the true
clusters), and A = A1, A2, …, Ak (clustering result by a
certain method), the clustering accuracy is evaluated
with the cluster similarity defined as:
Sim(G, A) 
imax j Sim(Gi , A j )
k
with Sim(Gi, Aj) 
2 | Gi  A j |
| Gi |  | A j |
Experiments
Clustering experiments
SYNDATA
Method
RTT
Method
Weight
Vector
Single level
VQ
[1 0 0 0 0]
0.55
[0 1 0 0 0]
0.52
0.63
[0 0 1 0 0]
0.57
[0 0 0 1 0]
0.51
[0 0 0 1 0]
0.80
[0 0 0 0 1]
0.49
[0 0 0 0 1]
0.79
[1 1 1 1 1]
0.82
MVQ
[0 0 0 1 1]
0.81
DFT
0.67
DFT
0.54
SAX
0.65
SAX
0.54
DTW
0.80
DTW
0.62
Euclidean
0.55
Euclidean
0.50
Single level
VQ
MVQ
Weight
Vector
Accuracy
[1 0 0 0 0]
0.69
[0 1 0 0 0]
0.71
[0 0 1 0 0]
Accuracy
Experiments
Summarization (SYNDATA)
Typical series:
Experiments
First Level
Second Level
Outline: A Multiresolution Symbolic
Representation of Time Series

Background

Methodology

Experimental results

Conclusion
Conclusion
• A new symbolic representation of time series
• Utilizes multiple resolutions
• A more meaningful similarity metric
• Improved efficiency due to the dimensionality
reduction
• Nice summarization of time series
• Uses prior knowledge (training process)