The Landmark Model: An Instance Selection Method for Time Series Data
Download
Report
Transcript The Landmark Model: An Instance Selection Method for Time Series Data
The Landmark Model: An Instance
Selection Method for Time Series Data
C.-S. Perng, S. R. Zhang, and D. S. Parker
Instance Selection and Construction for Data Mining,
Chapter 7, pp. 113-130
Cho, Dong-Yeon
Introduction
Complexity
Patterns: continuous time series segments with
particular features
The reflection of events in time series is better
represented by patterns.
The complexity of processing patterns
The
number of all possible segments for a time series of length
N is N(N+1)/2.
A simple inspection of each of these segments takes O(N3).
Good instance selection algorithms are especially
helpful here, since they can greatly reduce complexity
by reducing the volume of data.
Similarity Model
Euclidian distance does not match human intuition.
1,2,3,4,3
and 3,4,5,6,5
Previous works
None
of these proposed techniques supports a similarity model
that can both capture the similarity and support efficient
pattern querying of time series.
Pattern Representation
Two formats for temporal association rules to verify the
cause-effect relation
association: C1,…,Cn E1,…,Em
Backward association: C1,…,Cn E1,…,Em
Forward
Association rules can be either formulated as
hypotheses and verified with data, or be discovered by
data mining process.
It is sill not clear what kind of segments can
represented event.
What
is the basic vocabulary for spelling association rule?
Noise Removal and Data Smoothing
Commonly-used smoothing techniques, such as moving
averages, often lag or miss the most significant peaks and
bottoms.
These
peaks and bottoms can be very meaningful, and smoothing
or removing them can lose a great deal of information.
Little previous work takes smoothing as an integral part of
the process of pattern definition, index construction, and
query processing.
The Landmark Data Model and
Similarity Model
The Landmark Concept
Episodic memory: human and animals depend on
landmarks in organizing their spatial memory
Landmarks: (times, events)
Using
landmarks instead of the raw data for processing
N-th order landmark of a curve if the N-th order derivative is 0.
Local maxima, local minima, and inflection points
Tradeoff
The
more different types of landmarks in use, the more
accurately a time series will be represented.
Using fewer landmarks will result in storage savings and
smaller index trees.
Stock market data
Almost
half of the record
The normalized error is reasonably small when the curve is
reconstructed from the landmarks.
The more volatile the time series, the less significant the
higher-order landmarks.
Smoothing
Minimal Distance/Percentage Principle (MDPP)
A minimal
distance D and a minimal percentage P
Remove landmarks (xi, yi) and (xi+1, yi+1) if
xi 1 xi D and
| yi 1 yi |
P
(| yi | | yi 1 |) / 2
The effect of the MDPP
Normalized error generated by the MDPP and DFT
Transformations
Six kinds of transformations
Shifting:
SHk(f) such that SHk(f(t))=f(t)+k where k is a constant.
Uniform Amplitude Scaling: UASk(f) such that UASk(f(t))=kf(t)
where k is a constant.
Uniform Time Scaling: UTSk(f) such that UTSk(f(t))=f(kt)
where k is a positive constant.
Uniform Bi-scaling: UBSk(f) such that UBSk(f(t))=kf(t/k) where
k is a positive constant.
Time Warping: TWg(f) such that TWg(f(t))=f(g(t)) where g is a
positive and monotonically increasing.
Non-uniform Amplitude Scaling: NASg(f) such that
NASg(f(t))=g(t) where for every t, g´(t)=0 if and only if f´(t)=0.
The more transformation included in a similarity model,
the more powerful the similarity model.
These transformations can be composed to form new
transformations.
composition order is flexible: Fu Gv Gu Fv
The composition is idempotent: Fw Fu Fv
The
Two time series are defined to be similar if they differ
only by a transform.
Landmark Similarity
Dissimilarity measure
two sequences of landmarks L= L1,…,Ln and L´=
L´1,…,L´n where Li=(xi, yi) and L´i=(x´i, y´i), the distance
between the k-th landmark is defined by
k ( L, L) ( ktime ( L, L), kamp ( L, L)) where
Given
| ( xk xk 1 ) ( xk xk 1 ) |
if 1 k n
ktime ( L, L) (| xk xk 1 | | xk xk 1 |) / 2
0
otherwise
if yk yk
0
kamp ( L, L) | yk yk |
otherwise
(| yk | | yk |) / 2
The
distance between the two sequences is
( L, L) time ( L, L) , amp ( L, L) ( time , amp )
time
We define (
, amp ) ( time, amp ) if time time and amp amp
A land mark similarity measure is a binary relation on
time series segments defined by a 5-tuple
LSM=D,P,T,time,amp.
Given
two time series sequences s1
and s2, let L1 and L2 be the landmark
sequences after MDPP(D, P)
smoothing.
(s1, s2)LMS if and only if |L1|=|L2|
and there exist two parameterized
transformations T1 and T2 of T
whose dissimilarity satisfies
time(T1(L1), T2(L2)) < time and
amp(T1(L1), T2(L2)) < amp.
Data Representation
Family of Time Series Segments
Equivalent under the six transformations
Replacing
naïve landmark coordinates with various features of
landmarks that are invariant under these transformations
F = {y, h, v, hr, vr, vhr, pv}
hi=xi-xi-1 vi=yi-yi-1 hri=hi+1/hi vri=vi+1/vi vhri=vi/ hi pvi=vi/yi
Invariant features under transformations
Conclusion
Landmark Model
An instance selection system for time series
This integrates similarity measures, data representation and
smoothing techniques in a single framework.
Minimal
Distance/Percentage Principle (MDPP): The smoothing
method for the Landmark Model
This also supports a generalized similarity model which can
ignore differences corresponding to six transformations.
Intuitive to human