ICDE05-qv.ppt
Download
Report
Transcript ICDE05-qv.ppt
A Multiresolution Symbolic
Representation of Time Series
Vasileios Megalooikonomou1, Qiang Wang1, Guo Li1,
Christos Faloutsos2
1Temple
University, Philadelphia, USA
2Carnegie Mellon University, Pittsburgh, USA
Outline
Background
Methodology
Experimental results
Conclusion
Introduction
Time Sequence:
A sequence (ordered collection) of real values:
X = x1, x2,…, xn
……
Challenges:
• High dimensionality
• High amount of data
• Similarity metric definition
Introduction
Goal:
To achieve:
• High efficiency
• High accuracy
in similarity searches among time series and
in discovering interesting patterns
Introduction
Similarity metric for time series
• Euclidean Distance:
most common, sensitive to shifts
• Dynamic Time Warping (DTW):
improving accuracy, but time consuming O(n2)
• Envelope-based DTW:
improving time complexity, o(n)
Introduction
Similarity metric for time series
A more intuitive idea:
two series should be considered similar if
they have enough non-overlapping timeordered pairs of subsequences that are
similar (Agrawal et al. VLDB, 1995)
Introduction
Dimensionality reduction techniques:
• DFT: Discrete Fourier Transform
• DWT: Discrete Wavelet Transform
• SVD: Singular Vector Decomposition
• APCA: Adaptive Piecewise Constant Approximation
• PAA: Piecewise Aggregate Approximate
• SAX: Symbolic Aggregate approXimation
•…
Introduction
Suggested Solution:
Multiresolution Vector Quantized (MVQ) approximation
1) Uses a ‘vocabulary’ of subsequences
2) Takes multiple resolutions into account
3) Unlike wavelets partially ignores the
ordering of ‘codewords’
3) Exploits prior knowledge about the data
4) Provides a new distance metric
Outline: A Multiresolution Symbolic
Representation of Time Series
Background
Methodology
Experimental results
Conclusion
Methodology
A new framework (four steps):
• Create a ‘vocabulary’ of subsequences (codebook)
• Represent time series using codecords
• Utilize multiple resolutions
• Employ a new distance metric
Methodology
Codebook s=16
Generation
Series
cmdbca
minj j a
hldf ko
ogcbl p
l hnkkk
kkgj hh
Transformation
ifaj bb
ma I n j m
phcako
occbl h
pl cacg
gkgj lp
……
Series
Encoding
1121000000001000
1200010011000000
1000000012001100
1000000011002100
0001010100110010
1010000100100011
……
Methodology
Creating a ‘vocabulary’
Q: How to create?
Frequently appearing
patterns in
subsequences
A: Use Vector Quantization, in
particular, the Generalized Lloyd
Algorithm (GLA)
Produces a codebook based on two
conditions:
•Nearest neighbor Condition (NNC)
•Centroid condition (CC)
Output:
A codebook with s codewords
Methodology
Representing time series
X = x1, x2,…, xn
is encoded with a new representation
f = (f1,f2,…, fs)
(fi is the frequency of the i
th
codeword in X)
Methodology
New distance metric:
The histogram model is used to calculate
similarity at each resolution level:
1
S HM (q, t )
1 dis (q, t )
wit
h
s
f i ,t f i , q
i 1
1 f i ,t f i , q
dis (q, t )
Methodology
Time series summarization:
• High level information (frequently appearing
patterns) is more useful
• The new representation can provide this kind of
information
Both codeword
(pattern) 3 & 5
show up 2 times
Methodology
Problems of frequency based encoding:
• It can not record the location of a subsequence
• It is hard to define an approximate resolution
(codeword length)
• It may lose global information
Methodology
Utilizing multiple resolutions:
Solution: encoding with multiple resolutions
Each resolution level will be complementary
to each other
Reconstruction of time
series using different
resolutions
Methodology
New distance metric:
For all resolution levels a weighted
similarity metric is defined as:
c
S HHM (q, d j ) w i * S HMi (q, d j )
i 1
Methodology
Parameters of MVQ
X
Original time series, X= x1,x2,…,xn of length n
X’
Encoded form of the original time series
X′=f′1,f′2,…,f′s
N
Number of time series in the dataset
n
Length of original time series
C
Codebook: a set of codewords {c1,…,ck,…, cs}
c
Number of resolution levels
s
Size of codebook
l
Length of codeword
Methodology
Parameters of MVQ
•Number of resolution levels
c = log (n / lmin) +1
lmin is the minimal codeword length
•Length of codeword (on i
th
level)
l = n / 2i-1
•Size of codebook
Data dependent. However, in practice, small
codebooks can achieve very good results
Outline: A Multiresolution Symbolic
Representation of Time Series
Background
Methodology
Experimental results
Conclusion
Experiments
Datasets
SYNDATA (control chart data): synthetic
CAMMOUSE: 3 *5 sequences obtained using the
Camera Mouse Program
RTT: RTT measurements from UCR to CMU with
sending rate of 50 msec for a day
Experiments
Best Match Searching:
For a given query, time series within the same
class as the query (given our prior knowledge)
form the standard set (std_set(q) ), and the
results found by different approaches (knn(q) )
are compared to this set
The matching accuracy is defined as:
Accuracy
| knn(q) std_set(q) |
100%
k
Experiments
Best Match Searching
SYNDATA
CAMMOUSE
Method
Weight
Vector
Accuracy
Method
Weight
Vector
Accuracy
Single
level
VQ
[1 0 0 0 0]
0.55
[1 0 0 0 0]
0.56
[0 1 0 0 0]
0.70
[0 1 0 0 0]
0.60
[0 0 1 0 0]
0.65
Single
level
VQ
[0 0 0 1 0]
0.48
[0 0 1 0 0]
0.44
[0 0 0 0 1]
0.46
[0 0 0 1 0]
0.56
[1 1 1 1 1]
0.83
[0 0 0 0 1]
0.60
[1 1 1 1 1]
0.83
MVQ
Euclidean
0.51
MVQ
Euclidean
0.58
Experiments
Best Match Searching
(a)
(b)
Precision-recall for different methods
(a) on SYNDATA dataset (b) on CAMMOUSE dataset
Experiments
Clustering experiments
Given two clusterings, G=G1, G2, …, GK (the true
clusters), and A = A1, A2, …, Ak (clustering result by a
certain method), the clustering accuracy is evaluated
with the cluster similarity defined as:
Sim(G, A)
imax j Sim(Gi , A j )
k
with Sim(Gi, Aj)
2 | Gi A j |
| Gi | | A j |
Experiments
Clustering experiments
SYNDATA
Method
RTT
Method
Weight
Vector
Single level
VQ
[1 0 0 0 0]
0.55
[0 1 0 0 0]
0.52
0.63
[0 0 1 0 0]
0.57
[0 0 0 1 0]
0.51
[0 0 0 1 0]
0.80
[0 0 0 0 1]
0.49
[0 0 0 0 1]
0.79
[1 1 1 1 1]
0.82
MVQ
[0 0 0 1 1]
0.81
DFT
0.67
DFT
0.54
SAX
0.65
SAX
0.54
DTW
0.80
DTW
0.62
Euclidean
0.55
Euclidean
0.50
Single level
VQ
MVQ
Weight
Vector
Accuracy
[1 0 0 0 0]
0.69
[0 1 0 0 0]
0.71
[0 0 1 0 0]
Accuracy
Experiments
Summarization (SYNDATA)
Typical series:
Experiments
First Level
Second Level
Outline: A Multiresolution Symbolic
Representation of Time Series
Background
Methodology
Experimental results
Conclusion
Conclusion
• A new symbolic representation of time series
• Utilizes multiple resolutions
• A more meaningful similarity metric
• Improved efficiency due to the dimensionality
reduction
• Nice summarization of time series
• Uses prior knowledge (training process)