High Performance Algorithms for Multiple Streaming Time Series Xiaojian Zhao Advisor: Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York University Jan.

Download Report

Transcript High Performance Algorithms for Multiple Streaming Time Series Xiaojian Zhao Advisor: Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York University Jan.

High Performance Algorithms for
Multiple Streaming Time Series
Xiaojian Zhao
Advisor: Dennis Shasha
Department of Computer Science
Courant Institute of Mathematical Sciences
New York University
Jan. 10 2006
1
Roadmap:




Motivation
Incremental Uncooperative Time Series Correlation
Incremental Matching Pursuit (MP) (optional)
Future Work and Conclusion
2
Motivation (1)

Financial time series streams are watched closely by
millions of traders.


“Which pairs of stocks were correlated with a value of over
0.9 for the last three hours? Report this information every
half hour” (Incremental pairwise correlation)
“How to form a portfolio consisting of a small set of stocks
which replicates the market? Update it every hour”
(Incremental matching pursuit)
3
Motivation (2)



As processors speed up, algorithmic efficiency no
longer matters … one might think.
True if problem sizes stay same. But they don’t.
As processors speed up, sensors improve


Satellites spewing out more data a day
Magnetic resonance imagers give higher resolution images,
etc.
4
High performance incremental algorithms

Incremental Uncooperative Time Series Correlation



Monitor and report the correlation information among all time series
incrementally (e.g. every half hour)
Improve the efficiency from quadratic to super-linear
Incremental Matching Pursuit (MP)


Monitor and report the approximation vectors of matching pursuit
incrementally (e.g. every hour)
Improve the efficiency significantly
5
Incremental Uncooperative Time
Series Correlation
6
Problem statement:



Detect and report the correlation incrementally and
rapidly
Extend the algorithm into a general engine
Apply it in practical application domains
7
Online detection of high correlation
Correlated!
Correlated!
8
Pearson correlation and Euclidean distance

Normalized Euclidean distance  Pearson correlation

Normalization
Xi 


X i  avg( X sw )
var(X sw )
dist2=2(1- correlation)
From now on, we will not differentiate between correlation
and Euclidean distance
9
Naïve approach: pairwise correlation

Given a group of time series, compute the pairwise
correlation

Time O(WN2), where

N : number of streams

W: window size (e.g. in the past one hour)
Let’s see high performance algorithms!
10
Technical review
Framework: GEMINI
Tools: Data Reduction Techniques
 Deterministic Orthogonal vs. Randomized
 Fourier Transform, Wavelet Transform, and Random
Projection
Target: Various Data
 Cooperative vs. Uncooperative
11
GEMINI Framework*
Data reduction, e.g.
DFT, DWT, SVD
* Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series
databases,. SIGMOD, 1994
12
GEMINI: an example


Objective: find the nearest neighborhood (L2-norm) of each time series.
Compute the Fourier Transform over each of them, e.g. X and Y; yield
two coefficient vectors Xf and Yf
Xf=(a1, a2, …ak) and Yf=(b1, b2, …bk)
Original distance vs. coefficient distance (Parseval's Theorem)


d || x - y ||
 ( x1  y1 ) 2  ( x2  y2 ) 2    ( xk  yk ) 2
 (a1  b1 ) 2  (a2  b2 ) 2    (ak  bk ) 2
 (a1  b1 ) 2  (a2  b2 ) 2    (ah  bh ) 2
where h  k
 Because, for some data types, energy concentrates on first a few
frequency components, coefficient distance can work as a very
good filter and at the same time guarantee no false negatives
 They may be stored in a tree or grid structure
13
DFT on random walk
Ratio over total energy
Random Walk
1.2
1
0.8
ratio
0.6
0.4
0.2
0
1
10
19
28
37
46
55
64
73
82
91 100
The number of coefficients
~
d  (a1  b1 ) 2  (a2  b2 ) 2    (ah  bh ) 2
14
Review:
DFT/DWT vs. Random Projection
Fourier Transform, Wavelet Transform and SVD
 A set of orthogonal base (deterministic)

Based on Parseval's Theorem
Random Projection
 A set of random base (non-deterministic)
 Based on Johnson-Lindenstrauss (JL) Lemma
Orthogonal Base
Random Base
15
Review:
Random Projection: Intuition

You are walking in a sparse forest and you are lost.

You have an outdated cell phone without a GPS (w/o latitude&altitude).

You want to know if you are close to your friend.

You identify yourself at 100 meters from Bestbuy and 200 meters from
a silver building etc.

If your friend is at similar distances from several of these landmarks,
you might be close to one another.

Random projections are analogous to these distances to landmarks.
16
Random Projection
inner product
sketches
x  ( x1 , x2 , x3 ,...xw )
y  ( y1 , y2 , y3 ,...yw )
time series
R1  (r11 , r12 , r13 ,...,r1w )
R2  (r21 , r22 , r23 ,...,r2w )
R3  (r31 , r32 , r33 ,...,r3w )
R4  (r41 , r42 , r43 ,...,r4w )
( xsk1, xsk2 , xsk3 , xsk4 )
( ysk1 , ysk2 , ysk3 , ysk4 )
random vector
Sketch: A vector of output returned by random projection
17
Review:
Sketch Guarantees *
Johnson-Lindenstrauss ( JL) Lemma:

For any 0    1 and any integer n, let k be a positive integer such that
k  4( 2 / 2  d 3 / 3)1 ln n
Then for any set V of n points in R , there is a map f : R d  R k such that for
all u, v V

(1   ) || u  v ||2 || f (u)  f (v) ||2  (1   ) || u  v ||2

Further this map can be found in randomized polynomial time
•W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp.
Math.,26:189-206,1984
18
Empirical study : sketch approximation
Return Data
30
Distance
25
20
sketch
dist
15
10
5
0
1
82
163 244 325 406 487 568 649 730 811 892 973
Data Points
dsk  (( xsk 1  ysk 1 ) 2  ( xsk 2  ysk 2 ) 2  ...  ( xsk k  ysk k ) 2 ) / k
Time series length=256 and sketch size=30
19
Empirical study : sketch approximation
Sketch Distance vs. Real Distance
28
Real distance
25
22
19
16
13
10
10
13
16
19
22
25
28
Sketch distance
20
Empirical study: sketch distance/real
distance
5%
4%
3%
number
1%
Sketch=1000
Factor(Real distance/Sketch distance)
Factor distribution
Factor distribution
6%
5%
4%
number
3%
2%
1%
0%
Percentage of distance
7%
12%
10%
8%
number
6%
4%
2%
Factor(Real distance/Sketch distance)
0.
93
0.
94
0.
96
0.
98
1.
00
1.
02
1.
04
1.
06
1.
09
1.
11
1.
14
1.
16
1.
19
1.
25
1.
20
1.
16
1.
12
1.
09
1.
05
1.
02
0.
99
0.
96
0.
93
0%
0.
91
0.
88
1.
33
1.
27
1.
2
1.
1
1.
15
1.
05
1.
01
0.
97
0.
93
0.
9
0.
87
0.
84
0%
Sketch=80
Percentage of distance
Sketch=30
2%
0.
81
Percentage of distance
Factor distribution
Factor(Real distance/Sketch distance)
21
Data classification

Cooperative




Time series exhibiting a fundamental degree of regularity,
allowing them to be represented by the first few coefficients in the
spectral space with little loss of information
Example: Stock Price (random walk)
Tools: Fourier Transform, Wavelet Transform, SVD
Uncooperative



Time series whose energy is not concentrated in only a few
frequency components, e.g.
Example: Stock Return (= today' s price  yesterday' s price )
yesterday' s price
Tool: Random Projection
22
DFT on random walk and white noise
1.2
Cooperative
1
0.8
ratio
0.6
0.4
White Noise
0.2
0
1
10
19
28
37
46
55
64
73
82
The number of coefficients
Uncooperative
91 100
Ratio over total energy
Ratio over total energy
Random Walk
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ratio
1
9
17 25 33 41 49 57 65 73 81 89 97
The number of coefficients
23
Approximation Power: SVD
Distance vs. Sketch Distance
Comparison over Return Data
30
Distance
25
20
Real Dist
Sketch
15
SVD
10
5
0
0
100
200
300
400
500
600
700
800
900
Data Points
•Note: SVD is superior to DFT and DWT in approximation power.
•But all of them are all bad for uncooperative data.
•Here sketch size = 32 and SVD coefficient number =30
24
Our new algorithm*
• The big picture of the system
• Structured random vector (New)
• Compute sketch by structured convolution (New)
• Optimize in the parameter space (New)
• Empirical study
•Richard Cole, Dennis Shasha and Xiaojian Zhao. “Fast Window Correlations Over Uncooperative Time
Series”. SIGKDD 2005
25
Big Picture
time series 1
sketch 1
time series 2
sketch 2
time series 3
…
Random
Projection
…
Grid
structure
Correlated
pairs
sketch n
time series n
…
…
Data Reduction
Filtering
26
Our objective reminded




Monitor and report the correlation periodically e.g. “every
half hour”
We chose Random Projection as a means to reduce the data
dimension
The time series needs to be looked at in a time window.
This time window should slide forward as time goes on.
27
Definitions: Sliding window and Basic window
Basic window (bw)
Stock 1
Tim
e
poin
t
Stock 2
Stock 3
…
…
Stock n
Sliding window (sw)
Time
axis
Sliding window size=8
Basic window size=2
Example: Every half hour (bw) report the correlation of the last three hours (sw)
28
Random vector and naïve random projection


Choose randomly sw random numbers to form a random
vector R=(r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12)
Inner product starts from each data point
Xsk1=(x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12)*R
Xsk2=(x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13)*R
Xsk3=(x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14)*R
……
 We improve it in two ways
 Partition a random vector of length sw into several basic windows
 Use convolution instead of inner product
29
How to construct a random vector
Construct a random vector of 1/-1 of length sw.
Suppose sliding window size=12, and basic window size=4
The random vector within a basic window is Rbw  (r1 , r2 , r3 , r4 )
ri  1 /  1 with prob 
1
2
1
2
A final complete random vector for a sliding window may look like:
A control vector b  (b1 , b2 , b3 ) bi  1 /  1 with prob 
(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1)
Rbw
-Rbw
Rbw
Here Rbw=(1 1 -1 1)
b=(1 -1 1)
30
Naive algorithm and hope for improvement
r=( 1 1 -1 1 ; -1 -1 1 -1; 1 1 -1 1 )
x=(x1 x2 x3 x4; x5 x6 x7 x8; x9 x10 x11 x12)
dot
product
xsk=r*x= x1+x2-x3+x4-x5-x6+x7-x8+x9+x10-x11+x12
With new data point arrival, this operation will be done again
r= ( 1 1 -1 1 ; -1 -1 1 -1; 1 1 -1 1 )
x’=(x5 x6 x7 x8 ; x9 x10 x11 x12; x13 x14 x15 x16)
*
xsk=r*x’= x5+x6-x7+x8-x9-x10+x11+x12+x13+x14+x15- x16


There is redundancy in the second dot product given the first one.
We will eliminate the repeated computation to save time
31
Our algorithm
●
●
All the operations are over the basic window;
Pad Rbw with |bw-1| zeros, then convolve with Xbw
conv1:(1 1 -1 1 0 0 0)  (x1,x2,x3,x4)
conv2:(1 1 -1 1 0 0 0)  (x5,x6,x7,x8)
conv3:(1 1 -1 1 0 0 0)  (x9,x10,x11,x12)
x4
x4+x3
Animation shows convolution in action:
1 1 -1 1 0 0 0
x1 x4
x2
x1 x4
x2
x3 x2
x3
x4 x3
x4 x4
x1 x2
x1 x3
x2
x1 x4
x3
x2
x1 x4
x3
x2
x3
x1
-x4+x3+x2
x4-x3+x2+x1
x3-x2+x1
x2-x1
x1
32
Our algorithm: example
First Convolution
Second Convolution
x4
x8
+
Third Convolution
+
x12
x4+x3
x8+x7
x12-x11
x2+x3-x4
x6+x7-x8
x10+x11-x12
x1+x2-x3+x4
x5+x6-x7+x8
x9+x10-x11+x12
x1-x2+x3
x5-x6+x7
x9-x10+x11
x2-x1
x6-x5
x10-x9
x1
x5
x9
xsk1= (x1+x2-x3+x4)-(x5+x6-x7+x8)+(x9+x10-x11+x12)
xsk2=(x2+x3-x4+x5)-(x6+x7-x8+x9)+(x10+x11-x12+x13)
33
Our algorithm: example
First sliding
window
Second
sliding
window
sk1=(x1+x2-x3+x4)
sk5=(x5+x6-x7+x8)
sk9=(x9+x10-x11+x12)
xsk1= (x1+x2-x3+x4)-(x5+x6-x7+x8)+(x9+x10-x11+x12)
b= ( 1
-1
1)
sk2=(x2+x3-x4) + (x5)
sk6=(x6+x7-x8) + (x9)
sk10=(x10+x11-x12) + (x13)
Then sum up and we have
xsk2=(x2+x3-x4+x5)-(x6+x7-x8+x9)+(x10+x11-x12+x13)
b=( 1
-1
1)
(Sk1 Sk5 Sk9)*(b1 b2 b3)
* is inner product
34
Basic window version


Or if time series are highly correlated between two consecutive data
points, we may compute the sketch every basic window.
That is, we update the sketch for each time series only when data of a
complete basic window arrive. No convolution, only inner product.
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
1
1 –1
1
1
1 –1
1
1
x1+x2-x3+x4
x5+x6-x7+x8
1
–1
1
x9+x10-x11+x12
35
Overview of our new algorithm




The projection of a sliding window is decomposed into
operations over basic windows
Each basic window is convolved/inner product with each
random vector only once
We may provide the sketches starting from each data point
or starts from the beginning of each basic window.
There is no redundancy.
36
Performance comparison

Naïve algorithm
For each datum and random vector
(1) O(|sw|) integer additions

Pointwise version
Asymptotically for each datum and random vector
(1) O(|sw|/|bw|) integer additions
(2) O(log |bw|) floating point operations (use FFT in computing
convolutions)

Basic window version
Asymptotically for each datum and random vector
O(|sw|/|bw|2) integer additions
37
Big picture revisited
time series 1
sketch 1
time series 2
sketch 2
time series 3
…
time series n
Random
Projection
…
Grid
structure
Correlated
pairs
sketch n
…
…
Filtering
So far we reduce the data dimension efficiently. Next, how can it be used as a filter?
38
How to use the sketch distance as a filter

Naive method: compute the sketch distance:
d sk || xsk  ysk ||
d sk 
((xsk1  ysk1 ) 2  ( xsk2  ysk2 ) 2  ...  ( xskk  yskk ) 2 ) / k
d sk  c * dist


Being close by sketch distance are likely to be close by
original distance (JL Lemma)
Finally any close data pair will be double checked with the
original data.
39
Use the sketch distance as a filter



But we do not use it, why? Expensive.
Since we still have to do the pairwise comparison
between each pair of stocks which is O(n2k ) , k is the
size of the sketches, e.g. typically 30, 40, etc
Let’s see our new strategy
40
Our method: sketch unit distance
Given sketches:
xsk  ( xsk1, xsk 2, xsk3, xsk 4, xsk5, xsk 6, xsk 7, xsk8)
ysk  ( ysk1, ysk 2, ysk3, ysk 4, ysk5, ysk 6, ysk 7, ysk8)
We have
| xsk1  ysk1 | | xsk 2  ysk 2 | | xsk 3  ysk3 | | xsk 4  ysk 4 |
| xsk 5  ysk 5 | | xsk 6  ysk 6 | | xsk 7  ysk 7 | | xsk8  ysk8 |
If f distance chunks have | xski  yski | c * dist we may say || x  y || dist
where: f: 30%, 40%, 50%, 60% …
c: 0.8, 0.9, 1.1…
41
Further: sketch groups
xsk  ( xsk1, xsk 2, xsk3, xsk 4, xsk5, xsk 6, xsk 7, xsk8...)
ysk  ( ysk1, ysk 2, ysk 3, ysk 4, ysk5, ysk 6, ysk 7, ysk8...)
We may compute the sketch group:
dskg1 , dskg 2 , dskg 3 , 
where
dskgi || xsk gi  ysk gi ||
Remind us of
a grid
structure
For example
dsk g1 
(( xsk 1  ysk 1 ) 2  ( xsk 2  ysk 2 ) 2  ( xsk 3  ysk 3 ) 2  ( xsk 4  ysk 4 ) 2 ) / 4
If f sketch groups have | dskgi  dskgi | c * dist we may say || x  y || dist
42
Grid structure


To avoid checking all pairs, we can use a grid structure
and look in the neighborhood, this will return a super set
of highly correlated pairs.
The data labeled as “potential” will be double checked
using the raw data vectors.
x  ( x1 , x2 ,...xk )
43
Optimization in parameter space

How to choose the parameters g, c, f, N?
N: total number of the sketches
g: group size
c: the factor of distance
f: the fraction of groups which are necessary to claim that two time
series are close enough

We will choose the best one to be applied to the practical
data. But how? --- an engineering problem


Combinatorial Design (CD)
Bootstrapping
Now, Let’s put all together.
44
Z
Y
X
Inner product with random vectors r1,r2,r3,r4,r5,r6
( ysk1 , ysk2 , ysk3 , ysk4 , ysk5 , ysk6 )
( xsk1, xsk2 , xsk3 , xsk4 , xsk5 , xsk6 )
( zsk1 , zsk 2 , zsk3 , zsk 4 , zsk5 , zsk6 )
45
( xsk5 , xsk6 ) ( xsk3 , xsk4 ) ( xsk1 , xsk2 )
( ysk5 , ysk6 ) ( ysk3 , ysk4 ) ( ysk1 , ysk2 )
( zsk5 , zsk6 ) ( zsk3 , zsk4 ) ( zsk1 , zsk2 )
Grid structure
46
Empirical study: various data sources
 Cstr: Continuous stirred tank reactor
 Fortal_ecg: Cutaneous potential recordings of a pregnant woman
 Steamgen: Model of a steam generator at Abbott Power Plant in
Champaign IL
 Winding: Data from a test setup of an industrial winding process
 Evaporator: Data from an industrial evaporator
 Wind: Daily average wind speeds for 1961-1978 at 12 synoptic
meteorological stations in the Republic of Ireland
 Spot_exrates: The spot foreign currency exchange rates
 EEG: Electroencepholgram
47
Empirical study: performance comparison
Comparison of Processing Time
Normalized Timer
1.2
1
0.8
sketch
dft
scan
0.6
0.4
0.2
in
d
w
st
ea
m
ge
n
l_
ec
g
fo
et
a
ee
g
cs
tr
in
g
in
d
w
sp
o
t_
e
xr
at
es
ra
to
r
po
ev
a
re
tu
rn
pr
ic
e
0
Practical Data Sets
Sliding window=3616, basic window=32 and sketch size=60
48
Section conclusion


How to perform data reduction over uncooperative time
series efficiently in contrast to well-established methods for
cooperative data
How to cope with middle-size sketch vectors systematically.



Sketch vector partition, grid structure
Parameter space optimization by combinatorial design and
bootstrapping
Many ideas can be extended to other applications
49
Incremental Matching Pursuit (MP)
50
Problem Statement:

Imagine a scenario where a group of representative
stocks will be chosen to form an index e.g. for the
Standard and Poor’s (S&P) 500.
Target vector: The summation of all the vectors
weighted by their capitalization.
Candidate pool: All the stock price vectors in the market
Objective: Find from candidate pool a small group of
vectors representing the target vectors
51
Vanilla Matching Pursuit (MP)
• Greedily select a linear combination of vectors from a
dictionary to approximate a target vector
1. Set i=1;
2. Search the pool V and find the vector vi whose angle cos 
with respect to target vector vt is maximal;
|| vt ||
3. Compute the residue r = vt-civi where ci = || v || cos;
i
VA= VA {vi} ;
4. If r < error tolerance, then terminate and return VA ;
5. Else set i = i + 1 and vt = r, go back to 2 ;
52
vt
v1
v3
v2
53
The incremental setting*

Time granularity revisited
Basic window=a sequence of unit time points
Sliding window=several consecutive basic windows
Sliding window “slides” once per basic window
• Recomputing the representative vectors entirely for each
sliding window is wasteful since there may be a trend
between consecutive sliding windows
Xiaojian Zhao and Xin Zhang and Tyler Neylon and Dennis Shasha. “Incremental Methods for Simple
Problems in Time Series: algorithms and experiments”, IDEAS 2005
54
First idea: reuse vectors

The representative vectors may change only slightly in
both components and their order



True only if basic window is sufficiently small e.g. 2, 3 time
points
However, any newly introduced representative vector may
alter the entire tail of the approximation path
The relative importance of the same representative vector
may differ a lot from one sliding window to the next
55
Two insightful observations


The representative vectors are likely to remain the same
within a few sliding windows, though the order may
change
The cos  vector of angles keeps quite consistent, i.e.
(cos1 , cos 2 , cos3 ,…).


Here cos i is the cosine of angle between the ith residue and the
selected vector at that round.
An example is (0.9, 0.8, 0.7, 0.7, 0.6, 0.6, 0.6,…..)
56
Angle space exploration ( cos  )


Whenever a vector is found whose | cos | is larger than
some threshold, choose that vector.
If there is no such vector, the vector with largest | cos | is
selected as the representative vector at this round.
57
Second idea: cache good vectors



Those representative vectors appearing in the last several
sliding windows form a cache C
The search for a representative vector starts from C. If not
found then go to whole pool V
Works well in practice.
58
Empirical study: time comparison
Time Comparison
1.2
0.8
Incremental MP
0.6
Naive MP
0.4
0.2
10
/2
0%
50
/2
0%
10
0/
20
%
2/
20
%
10
/1
0%
50
/1
0%
10
0/
10
%
2/
10
%
10
0/
5%
50
/5
%
10
/5
%
0
2/
5%
time ratio
1
bw/power ratio
59
Empirical study: approximation power
comparison
Approximation Power Comparison
1.6
vector number ratio
1.4
1.2
1
Incremental MP
Naive MP
0.8
0.6
0.4
0.2
10
0/
20
%
50
/2
0%
10
/2
0%
2/
20
%
10
0/
10
%
50
/1
0%
10
/1
0%
2/
10
%
10
0/
5%
50
/5
%
10
/5
%
2/
5%
0
bw/power ratio
60
Future Work and Conclusion
61
Future work: Anomaly Detection


Measure the relative distance of each point from
its nearest neighbors
Our approach may serve as a monitor by reporting
those points far from any normal points
62
Conclusion
1. Motivation
2. Introduce the concept of cooperative vs. uncooperative
time series
3. Propose a set of strategies dealing with different data
(Random projection, Structured Convolution,
Combinatorial Design, Bootstrapping, Grid Structure)
4. Explore various incremental schemes


Filter away obvious irrelevancies
Reuse previous results.
5. Future Work
63
Thanks a lot!
64