Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang,

Download Report

Transcript Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang,

Fast Algorithms for Time Series with
applications to Finance, Physics, Music and
other Suspects
Dennis Shasha
Joint work with Yunyue Zhu, Xiaojian
Zhao, Zhihua Wang, and Alberto Lerner
{shasha,yunyue, xiaojian, zhihua,
lerner}@cs.nyu.edu
Courant Institute, New York University
Goal of this work
• Time series are important in so many
applications – biology, medicine, finance,
music, physics, …
• A few fundamental operations occur all the
time: burst detection, correlation, pattern
matching.
• Do them fast to make data exploration faster,
real time, and more fun.
2/107
Sample Needs
• Pairs Trading in Finance: find two stocks that
track one another closely. When they go out of
correlation, buy one and sell the other.
• Match a person’s humming against a database of
songs to help him/her buy a song.
• Find bursts of activity even when you don’t
know the window size over which to measure.
• Query and manipulate ordered data.
3/107
Why Speed Is Important
• Person on the street: “As processors speed up,
algorithmic efficiency no longer matters”
• True if problem sizes stay same.
• They don’t. As processors speed up, sensors
improve – e.g. satellites spewing out a terabyte a
day, magnetic resonance imagers give higher
resolution images, etc.
• Desire for real time response to queries.
4/107
Surprise, surprise
• More data, real-time response,
increasing importance of
correlation
IMPLIES
Efficient algorithms and data
management more important than
ever!
5/107
Corollary
• Important area, lots of new problems.
• Small advertisement: High Performance
Discovery in Time Series (Springer 2004).
At this conference.
6/107
Outline
• Correlation across thousands of time series
• Query by humming: correlation + shifting
• Burst detection: when you don’t know window
size
• Aquery: a query language for time series.
7/107
Real-time Correlation Across
Thousands (and scaling) of Time
Series
Scalable Methods for Correlation
• Compress streaming data into moving synopses.
• Update the synopses in constant time.
• Compare synopses in near linear time with
respect to number of time series.
• Use transforms + simple data structures.
(Avoid curse of dimensionality.)
9/107
GEMINI framework*
* Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in timeseries databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data.
Minneapolis, MN, May 25-27. pp 419-429.
10/107
StatStream (VLDB,2002): Example
• Stock prices streams
– The New York Stock Exchange (NYSE)
– 50,000 securities (streams); 100,000 ticks (trade and quote)
• Pairs Trading, a.k.a. Correlation Trading
• Query:“which pairs of stocks were correlated with a value of
over 0.9 for the last three hours?”
XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours.
Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down.
They should converge back later.
I will sell XYZ and buy ABC …
11/107
Online Detection of High Correlation
• Given tens of thousands of high speed time series data streams, to
detect high-value correlation, including synchronized and time-lagged,
over sliding windows in real time.
• Real time
– high update frequency of the data stream
– fixed response time, online
Correlated!
12/107
Online Detection of High Correlation
• Given tens of thousands of high speed time series data streams, to
detect high-value correlation, including synchronized and time-lagged,
over sliding windows in real time.
• Real time
– high update frequency of the data stream
– fixed response time, online
13/107
Online Detection of High Correlation
• Given tens of thousands of high speed time series data streams, to
detect high-value correlation, including synchronized and time-lagged,
over sliding windows in real time.
• Real time
– high update frequency of the data stream
– fixed response time, online
Correlated!
14/107
StatStream: Naïve Approach
• Goal: find most highly correlated stream pairs over sliding
windows
– N : number of streams
– w : size of sliding window
– space O(N) and time O(N2w) .
• Suppose that the streams are updated every second.
– With a Pentium 4 PC, the exact computing method can monitor only 700
streams, where each result is produced with a separation of two minutes.
– Note: “Punctuated result model” – not continuous, but online.
15/107
StatStream: Our Approach
– Use Discrete Fourier Transform to approximate correlation as
in Gemini approach.
– Every two minutes (“basic window size”), update the DFT
for each time series over the last hour (“window size”)
– Use grid structure to filter out unlikely pairs
– Our approach can report highly correlated pairs among
10,000 streams for the last hour with a delay of 2 minutes.
So, at 2:02, find highly correlated pairs between 1 PM and 2
PM. At 2:04, find highly correlated pairs between 1:02 and
2:02 PM etc.
16/107
StatStream: Stream synoptic data structure
• Three level time interval hierarchy
– Time point, Basic window, Sliding window
• Basic window (the key to our technique)
– The computation for basic window i must finish by the end of the
basic window i+1
– The basic window time is the system response time.
• Digests
Time point
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
Sliding window
17/107
StatStream: Stream synoptic data structure
• Three level time interval hierarchy
– Time point, Basic window, Sliding window
• Basic window (the key to our technique)
– The computation for basic window i must finish by the end of the
basic window i+1
– The basic window time is the system response time.
• Digests
Time point
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
Sliding window
18/107
StatStream: Stream synoptic data structure
• Three level time interval hierarchy
– Time point, Basic window, Sliding window
• Basic window (the key to our technique)
– The computation for basic window i must finish by the end of the
basic window i+1
– The basic window time is the system response time.
• Digests
Time point
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
Sliding window
Sliding window
digests:
sum
DFT coefs
19/107
StatStream: Stream synoptic data structure
• Three level time interval hierarchy
– Time point, Basic window, Sliding window
• Basic window (the key to our technique)
– The computation for basic window i must finish by the end of the
basic window i+1
– The basic window time is the system response time.
• Digests
Time point
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
Sliding window
Sliding window
digests:
sum
DFT coefs
20/107
StatStream: Stream synoptic data structure
• Three level time interval hierarchy
– Time point, Basic window, Sliding window
• Basic window (the key to our technique)
– The computation for basic window i must finish by the end of the
basic window i+1
– The basic window time is the system response time.
• Digests
Time point
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
Basic window
Sliding window
21/107
How general technique is applied
• Compress streaming data into moving synopses:
Discrete Fourier Transform.
• Update the synopses in time proportional to
number of coefficients: basic window idea.
• Compare synopses in real time: compare DFTs.
• Use transforms + simple data structures (grid
structure).
22/107
Synchronized Correlation Uses Basic Windows
co rr( s, r ) 
1
w

w
i 1
si ri  s r
i 1 ( si  s )
w
2
2
(
r

r
)
i 1 i
w
• Inner-product of aligned basic windows
Stream x
Stream y
Basic window
Sliding window
• Inner-product within a sliding window is the sum of the innerproducts in all the basic windows in the sliding window.
23/107
Approximate Synchronized Correlation
• Approximate with an orthogonal function family (e.g. DFT)
x1
x2
x3
x4
x5
x6
x7
x8
 c1x  f (1) f (2) f (3) f (4) f (5) f (6) f (7) f (8)
1
1
1
1
1
1
1
1
 c2x  f2(1) f2(2) f2(3) f2(4) f2(5) f2(6) f2(7) f2(8)
 c3x  f3(1) f3(2) f3(3) f3(4) f3(5) f3(6) f3(7) f3(8)
24/107
Approximate Synchronized Correlation
• Approximate with an orthogonal function family (e.g. DFT)
x1
x2
x3
x4
x5
x6
x7
x8
c1x , c2x , c3x
25/107
Approximate Synchronized Correlation
• Approximate with an orthogonal function family (e.g. DFT)
x1
x2
x3
x4
x5
x6
x7
x8
c1x , c2x , c3x
y1
y2
y3
y4
y5
y6
y7
y8
c1y , c2y , c3y
26/107
Approximate Synchronized Correlation
• Approximate with an orthogonal function family (e.g. DFT)
x1
x2
x3
x4
x5
x6
x7
x8
c1x , c2x , c3x
y1
y2
y3
y4
y5
y6
y7
y8
c1y , c2y , c3y
V (m), m  p
i f m (i) f p (i)   0, m  p


• Inner product of the time series
Inner product of the digests
• The time and space complexity is reduced from O(b) to O(n).
– b : size of basic window
– n : size of the digests (n<<b)
• e.g. 120 time points reduce to 4 digests
27/107
Approximate lagged Correlation
• Inner-product with unaligned windows
sliding window
sliding window
• The time complexity is reduced from O(b) to O(n2) , as
opposed to O(n) for synchronized correlation. Reason: terms
for different frequencies are non-zero in the lagged case.
28/107
Grid Structure(to avoid checking all pairs)
•
•
The DFT coefficients yields a vector.
High correlation => closeness in the vector space
– We can use a grid structure and look in the neighborhood, this will return a super
set of highly correlated pairs.
x
2

29/107
Empirical Study : Speed
Wall Clock Time (seconds)
Comparison of processing time
800
700
600
500
400
300
200
100
0
Naïve
DFT
200
400
600
800
1000 1200 1400 1600
Number of Streams
Our algorithm is parallelizable.
30/107
Empirical Study: Accuracy
• Approximation errors
– Larger size of digests, larger size of sliding window and smaller size of basic
window give better approximation
– The approximation errors (mistake in correlation coef) are small.
0.004
0.003
0.002
3
0.001
2
1
0
0.5
Sliding w indow s
(Hours)
0.5
1
2
Basic
windows
(Minutes)
Average Approximation
Error
0.005
31/107
Sketches : Random Projection*
• Correlation between time series of the returns of stock
– Since most stock price time series are close to random walks, their return
time series are close to white noise
– DFT/DWT can’t capture approximate white noise series because the
energy is distributed across many frequency components.
• Solution : Sketches (a form of random landmark)
– Sketch pool: list of random vectors drawn from stable distribution
– Sketch : The list of inner products from a data vector to the sketch pool.
– The Euclidean distance (correlation) between time series is approximated
by the distance between their sketches with a probabilistic guarantee.
•W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp.
Math.,26:189-206,1984
•D. Achlioptas. “Database-friendly random projections”. In Proceedings of the twentieth ACM SIGMOD32/107
SIGACT-SIGART symposium on Principles of database systems, ACM Press,2001
Sketches : Intuition
• You are walking in a sparse forest and you are lost.
• You have an old-time cell phone without GPS.
• You want to know whether you are close to your friend.
• You identify yourself as 100 meters from the pointy rock, 200
meters from the giant oak etc.
• If your friend is at similar distances from several of these
landmarks, you might be close to one another.
• The sketch is just the set of distances.
33/107
Sketches : Random Projection
inner product
x  ( x1 , x2 , x3 ,...xw )
y  ( y1 , y2 , y3 ,...yw )
raw time series
sketches
R1  (r11, r1 2, r1 3,...,r1w)
R2  (r21, r2 2, r2 3,...,r2 w)
R3  (r31, r3 2, r3 3,...,r3w)
( xsk1, xsk 2, xsk3, xsk 4)
( ysk1, ysk 2, ysk3, ysk 4)
R4  (r41, r4 2, r4 3,...,r4 w)
random vector
34/107
Sketches approximate distance well
(Real distance/sketch distance)
Factor distribution
Percentage of Distances
7%
6%
5%
4%
number
3%
2%
1%
1.
25
1.
22
1.
19
1.
16
1.
14
1.
11
1.
09
1.
06
1.
04
1.
02
1.
00
0.
98
0.
96
0.
94
0.
93
0.
91
0.
89
0%
Factor(Real Distance/Sketch Distance)
(Sliding window size=256 and sketch size=80)
35/107
Empirical Study: Sketch on Price and Return Data
• DFT and DWT work well for prices (today’s price is a
good predictor of tomorrow’s)
• But badly for returns
(todayprice – yesterdayprice)/todayprice.
• Data length=256 and the first 14 DFT coefficients are
used in the distance computation, db2 wavelet is used
here with coefficient size=16 and sketch size is 64
36/107
Empirical Comparison: DFT, DWT and Sketch
Price Data
50
Distance
40
sketch
30
dist
20
dwt
dft
10
0
1
98
195 292 389 486 583 680 777 874 971
Data Points
37/107
Empirical Comparison : DFT, DWT and Sketch
Return Data
30
Distance
25
dft
20
dwt
15
sketch
10
dist
5
0
1
92
183 274 365 456 547 638 729 820 911
Data Points
38/107
Sketch Guarantees
• Note: Sketches do not provide approximations of individual
time series window but help make comparisons.
Johnson-Lindenstrauss Lemma:
• For any 0    1 and any integer n, let k be a positive integer
such that
2
3
1
k  4( / 2   / 3) ln n
d
d
k
• Then for any set V of n points in R , there is a map f : R  R
such that for all u, v V
(1   ) || u  v ||2 || f (u)  f (v) ||2  (1   ) || u  v ||2
• Further this map can be found in randomized polynomial time
39/107
Overcoming curse of dimensionality*
• May need many random projections.
• Can partition sketches into disjoint pairs or triplets and
perform comparisons on those.
• Each such small group is placed into an index.
• Algorithm must adapt to give the best results.
*Idea from P.Indyk,N.Koudas, and S.Muthukrishnan. “Identifying representative trends
in massive time series data sets using sketches”. VLDB 2000.
40/107
X
Z
Y
Inner product with random vectors r1,r2,r3,r4,r5,r6
( ysk1 , ysk2 , ysk3 , ysk4 , ysk5 , ysk6 )
( xsk1, xsk2 , xsk3 , xsk4 , xsk5 , xsk6 )
( zsk1 , zsk 2 , zsk3 , zsk 4 , zsk5 , zsk6 )
41/107
( xsk5 , xsk6 ) ( xsk3 , xsk4 ) ( xsk1 , xsk2 )
( ysk5 , ysk6 ) ( ysk3 , ysk4 ) ( ysk1 , ysk2 )
( zsk5 , zsk6 ) ( zsk3 , zsk4 ) ( zsk1 , zsk2 )
42/107
Further Performance Improvements
-- Suppose we have R random projections of
window size WS.
-- Might seem that we have to do R*WS work for
each timepoint for each time series.
-- In ongoing work with colleague Richard Cole,
we show that we can cut this down by use of
convolution and an oxymoronic notion of
“structured random vectors”*.
*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”,
Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on
Principles of database systems
43/107
Empirical Study: Speed
600
500
Sketch_Return
400
Sketch_Price
300
DFT _Return
200
DFT _Price
Naive
100
10
00
12
00
14
00
16
00
18
00
20
00
22
00
80
0
60
0
40
0
0
20
0
Wall Clock Time(Seconds)
Comparison of Processing Time
Number of Streams
• Sketch/DFT+Grid structure
• Sliding Window Size=3616, basic window size=32
• Correlation>0.9
44/107
Empirical Study: Breakdown
40
35
30
25
20
15
10
5
0
Detecting Correlation
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
Updating Sketches
500
Wall Clock Time(Seconds)
Processing Time of Return
Number of Streams
45/107
Empirical Study: Breakdown
450
400
350
300
250
200
150
100
50
0
Detecting Correlation
Updating Sketches
50
0
10
00
15
00
20
00
25
00
30
00
35
00
40
00
45
00
50
00
55
00
Wall Clock Time(Seconds)
Processing Time of Price
Number of Streams
46/107
Query by humming:
Correlation + Shifting
Query By Humming
• You have a song in your head.
• You want to get it but don’t know its title.
• If you’re not too shy, you hum it to your friends or to a
salesperson and you find it.
• They may grimace, but you get your CD
48/107
With a Little Help From My Warped Correlation
• Karen’s humming
Match:
• Dennis’s humming
Match:
• “What would you do if I sang out of tune?"
• Yunyue’s humming
Match:
49/107
Related Work in Query by Humming
• Traditional method: String Matching
[Ghias et. al. 95, McNab et.al. 97,Uitdenbgerd and Zobel 99]
– Music represented by string of pitch directions: U, D, S (degenerated
interval)
– Hum query is segmented to discrete notes, then string of pitch directions
– Edit Distance between hum query and music score
• Problem
– Very hard to segment the hum query
– Partial solution: users are asked to hum articulately
• New Method : matching directly from audio
[Mazzoni and Dannenberg 00]
• We use both.
50/107
Time Series Representation of Query
Segment this!
• An example hum query
70
pitch values
60
50
40
30
20
10
0
0
1
2
3
4
5
6
7
8
9
10
11
time (seconds)
• Note segmentation is hard!
51/107
How to deal with poor hummers?
• No absolute pitch
– Solution: the average pitch is subtracted
• Inaccurate pitch intervals
– Solution: return the k-nearest neighbors
• Incorrect overall tempo
– Solution: Uniform Time Warping
• Local timing variations
– Solution: Dynamic Time Warping
• Bottom line: timing variations take us beyond Euclidean
distance.
52/107
Dynamic Time Warping
• Euclidean distance: sum of point-by-point distance
• DTW distance: allowing stretching or squeezing the time axis
locally
Time Series 1
Time Series 2
53/107
Envelope Transform using Piecewise Aggregate Approximation(PAA)
[Keogh
VLDB 02]
Original time series
Upper envelope
Lower envelope
U_Keogh
L_Keogh
54/107
Envelope Transform using Piecewise Aggregate Approximation(PAA)
Original time series
Upper envelope
Lower envelope
U_Keogh
L_Keogh
Original time series
Upper envelope
Lower envelope
U_new
L_new
• Advantage of tighter envelopes
– Still no false negatives, and fewer false positives
55/107
Container Invariant Envelope Transform
• Container-invariant A transformation T for envelope such that
Feature Space
• Theorem: if a transformation is Container-invariant and Lower-bounding,
then the distance between transformed times series x and transformed
envelope of y lower bound their DTW distance.
56/107
Framework to Scale Up for Large Database
rhythm (duration)
& melody (note)
Query criteria
Humming with ‘ta’
segment note/duration melody (note)
notes
sequence
keywords
statistics
based
features
boosting
Boosted
feature
Keyword
Database
filter
Database
filter
Nearest-N
search
Database
Alignment
verifier
on DTW
Top N
distance
match
with
transformed
envelope filter
Top
N’
match
57/107
Improvement by Introducing Humming with ‘ta’ *
• Solve the problem of note segmentation
• Compare humming with ‘la’ and ‘ta’
* Idea from N. Kosugi et al “A pratical query-by-humming system for a large
58/107
music database” ACM Multimedia 2000
Improvement by Introducing Humming with ‘ta’(2)
• Still use DTW distance to tolerate poor humming
• Decrease the size of time series by orders of magnitude.
• Thus reduce the computation of DTW distance
59/107
Statistics-Based Filters *
• Low dimensional statistic feature comparison
– Low computation cost comparing to DTW distance
– Quickly filter out true negatives
• Example
– Filter out candidates whose note length is much larger/smaller than the
query’s note length
• More
–
–
–
–
Standard Derivation of note value
Zero crossing rate of note value
Number of local minima of note value
Number of local maxima of note value
* Intuition from Erling Wold et al “Content-based classification, search and
retrieval of audio” IEEE Multimedia 1996 http://www.musclefish.com
60/107
Boosting Statistics-Based Filters
• Characteristics of statistics-based filters
– Quick but weak classifier
– May have false negatives/false positives
– Ideal candidates for boosting
• Boosting *
– “An algorithm for constructing a ‘strong’ classifier using only a training
set and a set of ‘weak’ classification algorithm”
– “A particular linear combination of these weak classifiers is used as the
final classifier which has a much smaller probability of misclassification”
* Cynthia Rudin et al “On the Dynamics of Boosting”
In Advances in Neural Information Processing Systems 2004
61/107
Verify Rhythm Alignment in the Query Result
•
•
Nearest-N search only used melody information
Will A. Arentz et al* suggests combining rhythm and melody
– Results are generally better than using only melody information
– Not appropriate when the sum of several notes’ duration in the query
may be related to duration of one note in the candidate
•
Our method:
– First use melody information for DTW distance computing
– Merge durations appropriately based on the note alignment
– Reject candidates which have bad rhythm alignment
* Will Archer Arentz “Methods for retrieving musical information based on
rhythm and pitch correlation” CSGSC 2003
62/107
Experiments: Setup
•
Data Set
–
–
•
Query Algorithms Comparison
–
–
–
•
TS: matching pitch-contour time series using DTW distance and envelope filters
NS: matching ‘ta’-based note-sequence using DTW distance and envelope filters plus
length and standard-derivation based filters
NS2: NS plus boosted statistics-based filters and alignment verifiers
Training Set: Human humming query set
–
–
–
•
1049 songs: Beatles, American Rock and Pop, one Chinese song
73,051 song segments
17 ‘ta’-style humming clips from 13 songs (half are Beatles songs)
The number of notes varies from 8 to 25, average 15
The matching song segments are labeled by musicians
Test Set: Simulated queries
–
–
Queries are generated based on 1000 randomly selected segments in database
The following error types are simulated: different keys/tempos, inaccurate pitch intervals,
varying tempos, missing/extra notes and background noise *
* The error model is experimental and the error parameters are
based on the analysis of training set
63/107
Experiments: Result
Average Time Cost
(seconds)
Hit Rate (%)
97.3
100.0
97.0
96.0
12
90.0
TS
NS
NS2
80.0
70.0
60.0
10.42
10
82.6
8
6
4
2.74
2
50.0
0.41
0
Top15
Top10
Top5
Top1
Ave
Maximum Time Cost (second)
160
140
120
100
TS
80
NS
60
NS2
40
20
0
150
TS
NS
NS2
21
2
Max
• NS performs better than TS when the scale is large
• NS2 is roughly 6~10 times faster than NS with a small loss of hit rate at about 1%
• NS2 can achieve
 A high hit rate: 97.0% (Top10)
 Quick responses: 0.42 seconds on average , 2 seconds in a worst case scenario
64/107
Query by Humming Demo
• 1039 songs (73051 note/duration sequences)
65/107
Burst detection:
when window size is unknown
Burst Detection: Applications
• Discovering intervals with unusually large numbers of events.
– In astrophysics, the sky is constantly observed for high-energy particles.
When a particular astrophysical event happens, a shower of high-energy
particles arrives in addition to the background noise. Might last
milliseconds or days…
– In telecommunications, if the number of packages lost within a certain
time period exceeds some threshold, it might indicate some network
anomaly. Exact duration is unknown.
– In finance, stocks with unusual high trading volumes should attract the
notice of traders (or perhaps regulators).
67/107
Bursts across different window sizes in Gamma Rays
• Challenge : to discover not only the time of the burst, but also the
duration of the burst.
68/107
Burst Detection: Challenge
• Single stream problem.
• What makes it hard is we are looking at multiple
window sizes at the same time.
• Naïve approach is to do this one window size at
a time.
69/107
Elastic Burst Detection: Problem Statement
• Problem: Given a time series of positive numbers x1, x2,..., xn,
and a threshold function f(w), w=1,2,...,n, find the subsequences
of any size such that their sums are above the thresholds:
– all 0<w<n, 0<m<n-w, such that xm+ xm+1+…+ xm+w-1 ≥ f(w)
• Brute force search : O(n^2) time
• Our shifted binary tree (SBT): O(n+k) time.
– k is the size of the output, i.e. the number of windows with bursts
70/107
Burst Detection: Data Structure and Algorithm
– Define threshold for node for size 2k to be threshold for window of size
1+ 2k-1
71/107
Burst Detection: Example
Window Size
Threshold
2
24
3
26
4
47
45
12
9
4
6
5
1
33
23
22
2
21
44
26
3
5
50
20
11
10
9
3
11
6
1
5
4
1
12
10
9
0
10
9
3
3
1
2
8
4
1
3
72/107
5
Burst Detection: Example
Window Size
Threshold
2
24
3
26
False Alarm
4
47
True Alarm
45
12
9
4
3
5
1
33
23
22
2
21
44
26
6
5
50
20
11
10
9
3
11
6
1
5
4
1
12
10
9
0
10
9
3
3
1
2
8
4
1
3
73/107
5
Burst Detection: Algorithm
• In linear time, determine whether any node in
SBT indicates an alarm.
• If so, do a detailed search to confirm (true
alarm) or deny (false alarm) a real burst.
• In on-line version of the algorithm, need keep
only most recent node at each level.
74/107
False Alarms (requires work, but no errors)
p=0.000001
False Alarm Rates
0.06
0.05
0.04
0.03
0.02
0.01
0
1
1.2
1.4
1.6
1.8
2
T=W/w
75/107
Empirical Study : Gamma Ray Burst
Processing time (ms)
Processing time vs. Number of Windows
80000
70000
60000
50000
40000
30000
20000
10000
0
SWT Algorithm
Direct Algorithm
0
10
20
30
40
50
Number of Windows
76/107
Case Study: Burst Detection(1)
Background:
Motivation: In astrophysics, the sky is constantly observed for
high-energy particles. When a particular astrophysical event
happens, a shower of high-energy particles arrives in addition
to the background noise. An unusual event burst may signal an
event interesting to physicists.
Technical Overview:
900
1.The sky is partitioned into
1800*900 buckets.
2.14 Sliding window lengths are
monitored from 0.1s to 39.81s
3.The original code implements
the naive  (n2 ) algorithm.
1800
77/107
Case Study: Burst Detection(2)
The challenges:
1.Vast amount of data

1800*900 time series, so any trivial overhead may be accumulated to
become a nontrivial expense.
2. Unavoidable overheads of data transformations


Data pre-processing such as fetching and storage requires much work.
SBT trees have to be built no matter how many sliding windows to be
investigated.
 Thresholds are maintained over time due to the different background noises.
 Hit on one bucket will affect its neighbours as shown in the previous figure
78/107
Case Study: Burst Detection(3)
Our solutions:
1. Combine near buckets into one to save space and processing
time. If any alarms reported for this large bucket, go down to see
each small components (two level detailed search).
2. Special implementation of SBT tree
 Build the SBT tree only including those levels covering the sliding windows
 Maintain a threshold tree for the sliding windows and update it over time.
Fringe benefits:
1. Adding window sizes is easy.
2. More sliding windows monitored also benefit physicists.
79/107
Case Study: Burst Detection(4)
Burst Detection
Experimental results:
1. Benefits improve with more sliding
windows.
2. Results consistent across different
data files.
3. SBT algorithm runs 7 times faster
than current algorithm.
4. More improvement possible if
memory limitations are removed.
1000
800
Running
Time
600
SBT1
400
old1
200
SBT2
0
14
SBT1
28
42
Sliding Window
80/107
SBT2
old2
Extension to other aggregates
• SBT can be used for any aggregate that is monotonic
– SUM, COUNT and MAX are monotonically increasing
• the alarm threshold is aggregate<threshold
– MIN is monotonically decreasing
• the alarm threshold is aggregate<threshold
– Spread =MAX-MIN
• Application in Finance
– Stock with burst of trading or quote(bid/ask) volume (Hammer!)
– Stock prices with high spread
81/107
Empirical Study : Stock Price Spread Burst
Processing time vs. Number of Windows
Processing time (ms)
1000000
100000
10000
1000
100
SWT Algorithm
10
Direct Algorithm
1
0
10
20
30
40
50
Number of Windows
82/107
Extension to high dimensions
83/107
Elastic Burst in two dimensions
• Population Distribution in the US
84/107
Can discover numeric thresholds from probability
threshold.
• Suppose that the moving sum of a time series is a random
variable from a normal distribution.
• Let the number of bursts in the time series within sliding
window size w be So(w) and its expectation be Se(w).
– Se(w) can be computed from the historical data.
• Given a threshold probability p, we set the threshold of burst
f(w) for window size w such that Pr[So(w) ≥ f(w)] ≤p.
85/107
Find threshold for Elastic Bursts
• Φ(x) is the normal cumulative dens funct, so little prob at ends:
Pr[X   1 ( p)]  p
Pr[X   1 ( p)]  p
• Red and blue line touch at negative x value. Symmetric above.
 So ( w)  Se ( w)

1
Pr
  ( p )   p
Se ( w)


Φ(x)

f (w)  Se (w) 1  1 ( p)

x
p
86/107
Φ-1(p)
Summary of Burst Detection
• Able to detect bursts on many different window sizes
in essentially linear time.
• Can be used both for time series and for spatial
searching.
• Can specify thresholds either with absolute numbers or
with probability of hit.
• Algorithm is simple to implement and has low
constants (code is available).
• Ok, it’s embarrassingly simple.
87/107
AQuery
A Database System for Order
Time Series and DBMS’s
• Usual approach is to store time series as a User
Defined Datatype (UDT) and provide methods for
manipulating it
• Advantages
– series can be stored and manipulated by DBMS
• Disadvantages
– UDTs are somewhat opaque to the optimizer (it misses
opportunities)
– Operations that mix series and “regular” data may be
awkward to write
89/107
The Case for Order1
• Series are sequences! Support them.
• Add order to multiset-based DBMS2,3,4
• AQuery is a query language (conservative
SQL extension) and a data model that
supports order from the ground up
1
“A Call to Order”, David Maier and Bennet Vance, PODS’93
“SRQL: Sorted Relational Query Language”, Ramakrishnan et al., SSDBM’98
3 “SEQ: A Model for Sequence Databases”, Seshadri et al., ICDE’95
4 OLAP Amendment to SQL:1999
90/107
2
An Arrable-based Data Model
• Arrable= a collection of named arrays with same
cardinality
• Arrays = vector of basic types or vectors thereof (no
further nesting!)
• Arrables’ order may be declared
• Two ways to accommodate series: “horizontally” and
“vertically”
Both arrables
ordered by ts
Ticks
ts time series
order
ID
price
ts
IBM
MSFT
IBM
IBM
MSFT
12.02
43.23
12.04
12.05
43.22
1
2
5
9
13
Series
ID
price
ts
IBM
MSFT
[12.02 12.04.12.05]
[43.23 43.22]
[1 5 9]
[2 13]
ID,ts time
series order
91/107
AQuery’s New
Language Elements
• ASSUMING ORDER clause
– Order is explicitly declared on a per-query basis
– All other clauses can count on order (and preserve it)
• Column-Oriented Semantics
– AQuery: R.col in a query binds to an array/vector (an entire
column)
– SQL: R.col binds successively to the scalar field values in a
column
• Built-in functions for order-dependent aggs over
columns
• Can group by without aggregation
92/107
Moving Average over Arrables
month
sold
4
2
3
1
5
140
120
140
100
130
month
sold
1
2
3
4
5
100
120
140
140
130
SELECT
FROM
month, sold, avgs(3,sold)
Sales
ASSUMING ORDER month
• Each query defines the data ordering
it wants to work with
• Order is enforced in the beginning of
the query, right after the FROM clause
• All expressions and clauses are order
preserving
93/107
Moving Average over Arrables
month
sold
1
2
3
4
5
100
120
140
140
130
SELECT
FROM
month, sold, avgs(3,sold)
Sales
ASSUMING ORDER month
• Variables are bound to an entire
column at once, as opposed to a
succession of its values
• Expressions are mappings from
a list of columns (arrays) into a
column
94/107
Moving Average over Arrables
month
sold
3-avg
1
2
3
4
5
100
120
140
140
130
100
110
120
133
136
SELECT
FROM
month, sold, avgs(3,sold)
Sales
ASSUMING ORDER month
• Several built in “vector-to-vector”
functions
• Avgs, for instance, takes a window
size and a column and returns the
resulting moving average column
• Other v2v functions: prev, next,
first, last, sums, mins, ..., and UDFs
95/107
Best-Profit Query
quote
Find the best profit one could make by buying a stock and
selling it later in the same day using Ticks(ID, price, tradeDate, ts)
20
18
16
14
12
10
8
6
4
2
0
19
17
16
15
15
13
14
13
11
8
10
7
5
5
5
2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
time
price
15 19 16 17 15 13 5 8 7 13 11 14 10 5 2 5
96/107
Formulating the
Best-Profit Query
quote
Find the best profit one could make by buying a stock and
selling it later in the same day using Ticks(ID, price, tradeDate, ts)
20
18
16
14
12
Sell at 14
10
8
6
4
2
0
Buy at 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
time
price
15 19 16 17 15 13 5 8 7 13 11 14 10 5 2 5
mins(price)15 15 15 15 15 13 5 5 5 5 5 5 5 5 2 2
0
4
1
2
0
0
0 3 2
8
6
9
0 0 0 3
97/107
Best-Profit Query Comparison
[AQuery]
SELECT max(price–mins(price))
FROM
ticks
ASSUMING timestamp
WHERE ID=“S”
AND tradeDate=‘1/10/03'
[SQL:1999]
SELECT max(rdif)
FROM
(SELECT ID,tradeDate,
price - min(price)
OVER
(PARTITION BY ID,
tradeDate
ORDER BY timestamp
ROWS UNBOUNDED
PRECEDING) AS rdif
FROM
Ticks ) AS t1
WHERE
ID=“S”
AND
tradeDate=‘1/10/03'
98/107
Best-Profit Query Comparison
[AQuery]
SELECT max(price–mins(price))
FROM
ticks
ASSUMING timestamp
WHERE ID=“S”
AND tradeDate=‘1/10/03'
[SQL:1999]
SELECT max(rdif)
FROM
(SELECT ID,tradeDate,
price - min(price)
OVER
(PARTITION BY ID,
tradeDate
ORDER BY timestamp
ROWS UNBOUNDED
PRECEDING) AS rdif
FROM
Ticks ) AS t1
WHERE
ID=“S”
AND
tradeDate=‘1/10/03'
AQuery optimizer can push down the selection simply by testing whether selection
is order dependent or not; SQL:1999 optimizer has to figure out if the selection would
alter semantics of the windows (partition), a much harder question.
99/107
Best-Profit Query Performance
18
16
time (seconds)
14
12
10
SQL:1999 Optimizer
8
AQuery
6
4
2
0
200
400
600
800
1000
Size in rows x1000
100/107
MicroArrays in a Nutshell
• Device to analyze
thousands of gene
expressions at once
• Base mechanism to
acquire data in a number
of genomic experiments
• Generates large matrices
and/or series that then
have to be analyzed, quite
often with statistical
methods
each spot
corresponds
to a cDNA
Entire chip is
exposed to stimuli
(light, nutrients, etc)
and each spot reacts
by producing more mRNA
(expressing) or less (repressing)
than base quantity
101/107
Generating Series in a Query
exprId
geneId
value
1
1
1
1
...
g1
g2
g3
g4
...
1.89
-0.5
1.3
0.8
...
geneId
g1
g2
g3
g4
...
value
1.89
-0.5
1.3
0.8
2.32 0.32 ...
1.22 1.03 ...
-0.38 -1.43 ...
0.79 3.23 ...
...
SELECT
FROM
GROUP
geneID, value
RawExpr
ASSUMING ORDER geneID, exprID
BY geneID
• AQuery allows grouping without
aggregation
• Non-grouped columns become
“vector-fields”
• Queries can manipulate “vectorfields” in the same way it does other
attributes
102/107
Taking Advantage of Ordering
each
geneID, value
Gby geneID
sort geneID, exprID
RawExpr
1
2
SELECT
FROM
GROUP
geneID, value
RawExpr
ASSUMING ORDER geneID, exprID
BY geneID
• Group by and sort share prefix
• Sort may be eliminated if RawExpr is
in a convenient order1,2
• Sort and Group by are commutative:
either sort all and then “slice” the
groups, or group by and perform
one smaller sort per group
“Bringing Order to Query Optimization”, Slivinskas et al., SIGMOD Record 31(2)
“Fundamental Techniques for Order Optimization”, Simmen et al., SIGMOD’96
103/107
Sort Cost Can Be Reduced
1
time (seconds)
1
1
0
Sort all
Sorts within groups
0
0
0
0
1
10
100
1000
10000
1 million records divided in x groups
104/107
Manipulating Series
geneId
g1
g2
g3
g4
...
value
1.89
-0.5
1.3
0.8
2.32 0.32 ...
1.22 1.03 ...
-0.38 -1.43 ...
0.79 3.23 ...
...
SELECT
FROM
WHERE
T1.geneID, T2.geneID
SeriesExpr T1, SeriesExpr T2
corr(T1.value, T2.value)> 0.9
• corr() implements Pearson’s
correlation
• It takes two columns of “vectorfields” as arguments
• It returns a vector of correlation
factors
• AQuery can be easily extended with
other UDFs
105/107
AQuery Summary
• AQuery is a natural evolution from the Relational
Model
• AQuery is a concise language for querying order
• Optimization possibilities are vast
• Applications to Finance, Physics, Biology,
Network Management, ...
106/107
Overall Summary
• Improved technology implies better sensors and
more data.
• Real-time response is very useful and profitable in
applications ranging from astrophysics to finance.
• These require better algorithms and better data
management.
• This is a great field for research and teaching.
107/107