Fast Calculations of Simple Primitives in Time Series Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York university Joint work with Richard.

Download Report

Transcript Fast Calculations of Simple Primitives in Time Series Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York university Joint work with Richard.

Fast Calculations of Simple
Primitives in Time Series
Dennis Shasha
Department of Computer Science
Courant Institute of Mathematical Sciences
New York university
Joint work with Richard Cole, Xiaojian Zhao
(correlation), Zhihua Wang (humming), Yunyue
Zhu (both), and Tyler Neylon (svds, trajectories)
1
Roadmap
Section 1 : Motivation
Section 2 : Statstream: A Fast Sliding Window based Correction
Detector






Problem Statement
Cooperative and Uncooperative Time Series
Algorithmic Framework
DFT based Scheme and Random Projection
Combinatorial Design and Bootstrapping
Empirical Study
Section 3 : Elastic Burst Detection




Problem Statement
Challenge
Shifted Binary Tree
Astrophysical Application
2
Overall Motivation

Financial time series streams are watched closely by millions of traders.



What exactly do they look for and how can we help them do it faster?
Typical query:“Which pairs of stocks had highly correlated returns over
the last three hours?”
Physicists study the time series emerging from their sensors.
Typical query:“Do there exist bursts of gamma rays in windows of any size
from 8 milliseconds to 4 hours?”

Musicians produce time series.

Typical query: “Even though I can’t hum well, please find this song. I want
the CD.”
3
Why Speed Is Important




As processors speed up, algorithmic efficiency
no longer matters … one might think.
True if problem sizes stay same but they
don’t.
As processors speed up, sensors improve -satellites spewing out a terabyte a day,
magnetic resonance imagers give higher
resolution images, etc.
Desire for real time response to queries.
/86
4
Surprise, surprise
More data, real-time response, increasing
importance of correlation
IMPLIES
Efficient algorithms and data management
more important than ever!

/86
5
Section 2:
Statstream: A Fast Sliding Window based
Correction Detector
6
Scenario



Stock prices streams
 The New York Stock Exchange (NYSE)
 50,000 securities (streams); 100,000 ticks (trade and quote)
Pairs Trading, a.k.a. Correlation Trading
Query:“which pairs of stocks were correlated with a value of over 0.9
for the last three hours?”
XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours.
Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down.
They should converge back later.
I will sell XYZ and buy ABC …
7
Motivation: Online Detection of High Correlation
Correlated!
Correlated!
8
Problem Statement

Synchronous time series window correlation



Given Ns streams, a start time tstart,and a window size w, find, for each
time window W of size w, all pairs of streams S1 and S2 such that S1 during
time window W is highly correlated with S2 during the same time window.
(Possible time windows are [tstart tstart+w - 1], [tstart+1 tstart+w], where
tstart is some start time.)
Asynchronous correlation


Allow shifts in time.
That is, given Ns streams and a window size w, find all time windows W1
and W2 where |W1 |= |W2 |= w and all pairs of streams S1 and S2 such that
S1 during W1 is highly correlated with S2 during W2.
9
Cooperative and Uncooperative Time Series

Cooperative time series




Exhibit a fundamental degree of regularity at least over the short
term,
Allow long time series to be compressed to a few coefficients with
little loss of information using data reduction techniques such as
Fourier Transforms and Wavelet Transforms.
Example: stock price time series
Uncooperative time series


Regularities are absent – resembles noise.
Example: stock return time series (difference in price/avg price)
10
Algorithmic Framework

Basic Definitions:

Timepoint:


Basic window:


The smallest unit of time over which the system collects data, e.g., a
second.
A consecutive subsequence of time points over which the system
maintains a digest (i.e., a compressed representation) and returns
resuls e.g., two minutes.
Sliding window:


A user-defined consecutive subsequence of basic windows over which
the user wants statistics, e.g., an hour.
The user might ask, “which pairs of streams were correlated with a
value of over 0.9 for the last hour?” Then again 2 minutes later.
11
Definitions: Sliding window and Basic window
Time
point
Basic window
Stock 1
Stock 2
Stock 3
…
…
Stock n
Sliding
window
Time
axis
Sliding window size=8
Basic window size=2
12
Algorithmic Strategy (cooperative case)
time series 1
time series 2
time series 3
…
time series n
Dimensionality
Reduction
(DFT, DWT,
SVD)
digest 1
digest 2
…
Grid
structure
Correlated
pairs
digest n
…
…
13
GEMINI framework
(Faloutsos et al.)
Transformation ideally has lower-bounding property
14
DFT based Scheme*
Time
Basic window
digests:
sum
DFT coefs
Basic window
digests:
sum
DFT coefs
point
Basic window
Sliding window
*D. Shasha and Y. Zhu. High Performance Discovery in Time Series: Techniques and Case Studies.
Springer, 2004.
15
Incremental Processing




Compute the DFT a basic window at the time.
Then add (with angular shifts) to get a DFT for the whole
sliding window. Time is just DFT time for basic window +
time proportional to number of DFT components we need.
Using the first few DFT coefficients for the whole sliding
window, represent the sliding window by a point in a grid
structure.
End up having to compare very few time windows, so a
potentially quadratic comparison problem becomes linear
in practice.
16
Grid Structure
x  ( x1 , x2 ,...xk )

17
Problem: Doesn’t always work



DFT approximates the price-like data type very well.
However, it is poor for stock returns
(today’s price – yesterday’s price)/yesterday’s price.
Return is more like white noise which contains all
frequency components.
DFT uses the first n (e.g. 10) coefficients in approximating
data, which is insufficient in the case of white noise.
18
DFT on random walk (works well)
and white noise (works badly)
Random Walk
1.2
0.8
0.6
ratio
0.4
0.2
White Noise
0
96
0.8
0.7
0.6
0.5
ratio
0.4
0.3
0.2
0.1
96
101
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
0
1
Ratio over total energy
The number of coefficients
101
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
0.9
1
Ratio over total energy
1
The number of coefficients
19
Random Projection: Intuition

You are walking in a sparse forest and you are lost.

You have an outdated cell phone without a GPS.

You want to know if you are close to your friend.

You identify yourself as 100 meters from the pointy rock
and 200 meters from the giant oak etc.

If your friend is at similar distances from several of these
landmarks, you might be close to one another.

Random projections are analogous to these distances to
landmarks.
20
How to compute a Random Projection*

Random vector pool: A list of random vectors drawn from stable
distribution (like the landmarks)

Project the time series into the space spanned by these random vectors

The Euclidean distance (correlation) between time series is
approximated by the distance between their sketches with a
probabilistic guarantee.

Note: Sketches do not provide approximations of individual time
series window but help make comparisons.
•W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”.
Contemp. Math.,26:189-206,1984
21
Random Projection
X’ current
position
inner product
x  ( x1 , x2 , x3 ,...xw )
y  ( y1 , y2 , y3 ,...yw )
X’
relative
distances
Rocks,
buildings
…
R1  (r11, r1 2, r1 3,...,r1w)
R2  (r21, r2 2, r2 3,...,r2 w)
R3  (r31, r3 2, r3 3,...,r3w)
( xsk1, xsk 2, xsk3, xsk 4)
( ysk1, ysk 2, ysk3, ysk 4)
R4  (r41, r4 2, r4 3,...,r4 w)
Y’ current
position
raw time series
random vector
sketches
Y’
relative
distances
Sketch Guarantees
Johnson-Lindenstrauss Lemma:
For any 0    1 and any integer n, let k be a positive integer
such that
2
3
1

k  4( / 2   / 3) ln n
d
k
d
 Then for any set V of n points in R , there is a map f : R  R
such that for all u, v V
(1   ) || u  v ||2 || f (u)  f (v) ||2  (1   ) || u  v ||2

Further this map can be found in randomized polynomial time
23
Empirical Study: sketch distance/real distance
5%
4%
3%
number
1%
Sketch=1000
Factor(Real distance/Sketch distance)
Factor distribution
Factor distribution
6%
5%
4%
number
3%
2%
1%
0%
Percentage of distance
7%
12%
10%
8%
number
6%
4%
2%
Factor(Real distance/Sketch distance)
0.
93
0.
94
0.
96
0.
98
1.
00
1.
02
1.
04
1.
06
1.
09
1.
11
1.
14
1.
16
1.
19
1.
25
1.
20
1.
16
1.
12
1.
09
1.
05
1.
02
0.
99
0.
96
0.
93
0%
0.
91
0.
88
1.
33
1.
27
1.
2
1.
1
1.
15
1.
05
1.
01
0.
97
0.
93
0.
9
0.
87
0.
84
0%
Sketch=80
Percentage of distance
Sketch=30
2%
0.
81
Percentage of distance
Factor distribution
Factor(Real distance/Sketch distance)
24
Empirical Comparison : DFT, DWT and Sketch
Stock Return Data
30
dft
20
dwt
15
sketch
10
dist
5
976
911
846
781
716
651
586
521
456
391
326
261
196
131
66
0
1
Distance
25
Data Points
25
Algorithm overview using random
projections/sketches



Partition each sketch vector s of size N into groups of some
size g;
The ith group of each sketch vector s is placed in the ith grid
structure (of dimension g).
If two sketch vectors s1 and s2 are within distance
 cd,
where d is the target distance, in more than a fraction f
of the groups, then the corresponding windows are
candidate highly correlated windows and should be
checked exactly.
26
Optimization in Parameter Space

Next, how to choose the parameters g, c, f,
N?
Size of Sketch (N): 30, 36, 48, 60
Group Size (g): 1, 2, 3, 4
Distance Multiplier (c): 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,
0.8, 0.9, 1, 1.1, 1.2, 1.3
Fraction (f): 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1
27
Optimization in Parameter Space


Essentially, we will prepare several groups of
good parameter candidates and choose the best
one to be applied to the given data
But, how to select the good candidates?


Combinatorial Design (CD)
Bootstrapping
28
Combinatorial Design

The pair-wise combinations of all the parameters
Informally: Each value of parameter X will be combined
with each value of parameter Y in at least one
experiment, for all X, Y
Example: if there are four parameters having
respectively 4, 4, 13, and 10 values, exhaustive
search requires 2080 experiments vs. 130 for pairwise combinatorial design. sketchparam.csv
*http://www.cs.nyu.edu/cs/faculty/shasha/papers/comb.html
29
Exploring the Neighborhood around
the Best Values
Because combinatorial design is NOT exhaustive,
we may not find the optimal combination of
parameters at first.
Solution: When good parameter values are found,
their local neighbors will be searched further for
better solutions
30
How Bootstrapping Is Used


Goal: Test the robustness of a conclusion on a
sample data set by creating new samples from
the initial sample with replacement.
Procedure:




A sample set with 1,000,000 pairs of time series
windows.
Among them, choose with replacement 20,000
sample points
Compute the recall and precision each time
Repeat many times (e.g. 100 or more)
31
Testing for stability



Bootstrap 100 times
Compute mean and std of recalls and precisions
What we want from good parameters
Mean(recall)-std(recall)>Threshold(recall)
Mean(precision)-std(precision)>Threshold(precision)

If there are no such parameters, enlarge the
replacement sample size
32
X
Z
Y
Inner product with random vectors r1,r2,r3,r4,r5,r6
( ysk1 , ysk2 , ysk3 , ysk4 , ysk5 , ysk6 )
( xsk1, xsk2 , xsk3 , xsk4 , xsk5 , xsk6 )
( zsk1 , zsk 2 , zsk3 , zsk 4 , zsk5 , zsk6 )
( xsk5 , xsk6 ) ( xsk3 , xsk4 ) ( xsk1 , xsk2 )
( ysk5 , ysk6 ) ( ysk3 , ysk4 ) ( ysk1 , ysk2 )
( zsk5 , zsk6 ) ( zsk3 , zsk4 ) ( zsk1 , zsk2 )
Grid structure
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
std_dft
bu
w
oy ind
_s
en
ev sor
ap
or
ato
fo
r
et
al_
sp
ot e cg
_e
xr
at
e
ste s
am
ge
n
w
in
di
ng
pr
ic
e
re
tu
rn
mean_dft
Practical Data Sets
Sketch Distance/Real Distance
2.5
2
ratio
ee
g
cs
tr
Ratio
DFT Distance/Real Distance
1.5
std_sketch
1
mean_sketch
0.5
0
tr
cs
g
ee
cg
en
es
ng ric e urn
or
or
t
ns orat al_e xra t mg indi
p
e
re
a
s
t
e
p
e
e
w
_
t
a
s
fo pot_
oy ev
s
bu
w
d
in
Practical Data Sets
tes
g
ge
n
wi
nd
m
l_
ec
g
ste
a
fo
eta
ee
cs
tr
wi
nd
in
g
xr
a
tor
tur
n
or
a
sp
ot_
e
ev
ap
re
pr
ice
Wall Clock Time(second)
Experiments
Comparison of Processing Time
1.2
1
0.8
sketch
0.6
dft
0.4
scan
0.2
0
Practical Data Sets
36
Section 3: Elastic Burst Detection
37
Elastic Burst Detection: Problem Statement



Problem: Given a time series of positive numbers x1, x2,...,
xn, and a threshold function f(w), w=1,2,...,n, find the
subsequences of any size such that their sums are above
the thresholds:
 all 0<w<n, 0<m<n-w, such that xm+ xm+1+…+ xm+w-1 ≥
f(w)
Brute force search : O(n^2) time
Our shifted binary tree (SBT): O(n+k) time.
 k is the size of the output, i.e. the number of windows
with bursts
38
Burst Detection: Challenge



Single stream problem.
What makes it hard is we are looking at multiple
window sizes at the same time.
Naïve approach is to do this one window size at a
time.
39
Astrophysical Application
Motivation:
In astrophysics, the sky is constantly observed for high-energy particles.
When a particular astrophysical event happens, a shower of high-energy
particles arrives in addition to the background noise. An unusual event burst
may signal an event interesting to physicists.
Technical Overview:
900
1.The sky is partitioned into
1800*900 buckets.
2.14 Sliding window lengths are
monitored from 0.1s to 39.81s
3.The original code implements the
naive window-at-a-time
algorithm. Can’t do more windows.
1800
40
Bursts across different window
sizes in Gamma Rays
Challenge : to discover not only the time of
the burst, but also the duration of the burst.
41
Shifted Binary Tree (SBT)

Define threshold for node for size 2k to be
threshold for window of size 1+ 2k-1
42
Burst Detection using SBT




Any window of size w, 2i-1+2 w  2i+1, is included in
one of the windows at level i+1.
For non-negative data stream and a monotonic
aggregation function, if a node at level i+1 doesn’t
exceed the threshold for window size 2i-1+2, none of
the windows of sizes between 2i-1+2 and 2i+1 will
contain a burst; otherwise need detailed search to test
for real bursts
Filter many windows, thus reducing the CPU time
dramatically
Shortcoming: fixed structure. Can do badly if bursts very
unlikely or relatively likely.
43
Shifted Aggregation Tree


Hierarchical tree structure-each node is an
aggregate
Different from the SBT in two ways:
■
■
Parent-child structure:
Define the topological relationship between a node
and its children
Shifting pattern:
Define how many time points apart between two
neighboring nodes at the same level
44
Aggregation Pyramid (AP)




N-level isosceles triangularshaped data structure built on a
sliding window of length N
Level 0 has a one-to-one
correspondence to the input
time series
Level h stores the aggregates
for h+1 consecutive elements,
i.e, a sliding window of length
h+1
AP stores every aggregate for
every window size starting at
every time point
45
Aggregation Pyramid Property





45o diagonal: same starting time
135o diagonal: same ending time
Shadow of cell(t,h): a sliding
window starting at time t and
ending at t+h-1
Coverage of cell(t,h): all the cells
in the sub-pyramid rooted at
cell(t,h)
Overlap of cell(t1,h1) and
cell(t2,h2): a cell at the
intersection of the 135o diagonal
touching cell(t1,h1) and the 45o
diagonal touching cell(t2,h2)
46
Embed Shifted Binary Tree in Aggregation
Pyramid
47
Aggregation Pyramid as a Host Data
Structure



Many structures besides Shifted Binary Tree in an
Aggregation Pyramid
The update-filter-search framework guarantees
detection of all the bursts as long as the structure
includes the level 0 cells and the top-level cell
What kind of structures are good for burst detection?
48
Which Shifted Aggregation Tree to be used?



Many Shifted Aggregation Trees available,
all of them guarantee detection of all the
bursts, which structure to be used?
Intuitively, the denser a structure, the more
updating time, and the less detailed search
time, and vice versa.
The structure minimizing the total CPU
running time, given the input
49
State-space Algorithm


View a Shifted
Aggregation Tree
(SAT) as a state
View the growth
from one SAT to
another as a
transformation
between states
50
State-space Algorithm




Initial state: a Shifted Aggregation Tree (SAT)
containing only level 0
Transformation rule: If by adding one level to the top of
SAT B, we get SAT A, state B is transformed to state A
Final state: a SAT which can cover the max window size
of interest
Traversing strategy: best-first search


Associate each state with a cost
Prune: to explore more reasonable structures
51
Results
CPU Time vs. l - Poisson
20000

CPU Time (ms)
Shifted Aggregation
Tree outperforms
Shifted Binary Tree.
 A factor of 35 times
speedup in some
experiment
Shifted Aggregation
Tree can adjust its
structure to adapt
different inputs.
CPU Time (ms)

15000
SAT
10000
SBT
Naive
5000
0
0.001 0.01
0.1
1
10
100
1000
l
CPU Time vs. Threshold - Poisson
40000
30000
SAT
20000
SBT
10000
0
2
3
4
5
6
7
8
Burst Probability p=10-k
9
10
52
Greedy Dynamic Burst Detection

Real world data keeps changing
Poor if the training data differs significantly from the data to be
detected

10%-250% more detection time shown in the figure below
CPU time using different training sets
7000
6000
CPU time (ms)

5000
4000
static
3000
2000
1000
0
0.8
0.9
1
1.1
1.2
Training sets w ith different l
53
Ideas



Basic ideas: change a structure if a change helps to reduce the
overall cost
Greedy when making a structure denser
 If the saved detailed search cost is greater than the
updating/filtering cost, add this level
Delayed greedy when making a structure sparser
 Alarms likely occur in clusters, across multiple sizes and
multiple neighboring windows
 A lower level may support a higher level
 Don’t remove a level if an alarm occurred recently
54
Algorithm Sketch



Start with a trained structure using the state-space algorithm
If an alarm is raised at level k
 If adding a level in between level k and level k-1 can save
some cost, add this level
 If can’t add due to some lower level to support this level not
in the structure, add the lower level
 If can’t add because that the shift doesn’t satisfy the property
of Shifted Aggregation Tree, legally narrow the shifts
elseif the aggregate at level k doesn’t exceed the minimum
threshold for level k-1
 If no alarm occurred recently
 If legal to remove level k-1, remove it
 else legally double the shift
55
Results

Different training sets
 The dynamic algorithm
overcomes the discrepancy
resulting from a biased
training data
Different sets of window sizes
 When the number of window
sizes is small, the dynamic
algorithm performs slightly
less well than the static
algorithm
static
7000
6000
5000
4000
3000
2000
1000
0
dynamic
0.8
0.9
1
1.1
1.2
Training Sets w ith different l
CPU tim e for different sets of w indow sizes Poisson
5000
CPU time (ms)

CPU Times (ms)
CPU tim e for different training sets vs. dynam ic
algorithm - Poisson
4000
3000
2000
Static
1000
Dynamic
0
10
25
50
75
100
Num ber of w indow sizes
56
Volume Spike Detection in Stocking Trading



Trading volume indicates buying/selling interest, the
underlying force for price movements
Volume spike: a
burst of trading
activity, a signal in
program trading
High rate: more
than 100 updates
per second per
stock
(from marketvolume.com)
57
Volume Spike Detection in Stocking Trading
CPU Time vs. Threshold - IBM
SBT
CPU Time (ms)
80000
SAT - dynamic
40000
20000
0
2
3
4
5
6
7
8
Burst Probability p=10
9
10
-k
CPU Time vs. Max Window Size of Interest - IBM
300000
CPU Time (ms)

Setup
 Jan. 2001 - May 2004 tick-bytick trading activities of the
IBM stock
 23 million time points
 Exponential distribution
Results
 Real time response: 0.01ms
per time point on average
 3-5 times speedup over
Shifted Binary Tree
 The dynamic algorithm
performs slightly better than
the static algorithm.
SBT
250000
SAT - static
200000
SAT - dynamic
150000
100000
50000
0
10
30
60
120
300
600
1800
Max Window Size of Interest
CPU Time vs. Set of Window Sizes - IBM
SBT
25000
CPU Time (ms)

SAT - static
60000
SAT - static
20000
SAT - dynamic
15000
10000
5000
0
10
30
60
120
Number of Window Sizes
240
58
Click Fraud Detection in Website Traffic Monitoring
CPU Time vs. Threshold - SDSS
SBT
SAT - static
SAT - dynamic
CPU Time (ms)
80000
Setup
■
■
■

2003 Sloan Digital Sky Survey web
traffic log, same type of data as
click data
Number of requests within each
second
31 million time points
20000
2

The dynamic algorithm performs
better than the static algorithm.
5
6
7
8
9
10
-k
500000
SBT
400000
SAT-static
300000
SAT-dynamic
200000
100000
0
Poisson distribution
2-5 times speedup over Shifted
Binary Tree
4
CPU Time vs. Max Window Size of Interest - SDSS
10
30
60
120
300
600
1800
Max Window Size of Interest
Results:

3
Burst Probability p=10
CPU Time (ms)
■
40000
0
CPU Time vs. Set of Window Sizes - SDSS
SBT
40000
CPU Time (ms)

60000
SAT-static
30000
SAT-dynamic
20000
10000
0
10
30
60
120
Number of Window Sizes
240
59
Fast and Accurate
Time Series Matching with Time-Warping
60
Outline




Problem Statement
Related work review
Case study: Query-by-Humming
Future work
61
Goal of this work

Goal


Build fast and accurate similarity search
algorithms for large scale time series
system that allow complex time shifting in
the query
Two Major Challenges


Query ambiguity
The large size of the database
63
Related Work Review

GEMINI framework


Dynamic Time Warping (DTW)


Introduced by C. Faloutsos, M. Ranganathan and Y.
Manolopoulos to avoid linear scan comparison
Introduced by D. Berndt and J. Clifford to allow timeshifting in the time series comparison
Envelope and Envelope Transforms


Introduced by E. Keogh to index DTW distance
Generalized into GEMINI framework by our group.
64
Dynamic Time Warping
DTW Distance between two time series x,y is
Equal to optimal path finding
Time Series 1
Time Series 2
• Each path (1,1)(m,n) is an alignment
• (i,j) represents aligning x(i) with y(j)
• cost(i,j) equals |x(i)-y(j)|
• Optimal path has minimum total cost
65
Problem of DTW Distance
DTW Distance does not obey triangle-inequality.
which most standard indexing methods require.
66
Envelope and Envelope Transform
Filter out bad candidates with lower computing cost
and guarantee no false negative
Feature Space
• Envelope Filter
• Transformed Envelope Filter
67
Example of Envelope Transform

Piecewise Aggregate Approximation (PAA)
Original time series
Upper envelope
Lower envelope
U_new
L_new
68
Case Study: Query by
Humming
Humming query




Music
Database
System
return
Similar
music
segments
Application Specific Challenges
Related Work
Proposed Framework
Experiment
69
Challenges


People may hum out of tune
 Different base key
 Inaccurate relative pitch
 Instable pitch
 Different tempo
 Varying tempo
Hard to segment notes in the humming
70
Flourishing Literature

String-represented note-sequence matching
[A. Uitdenboderd et al Matching techniques for large music databases.ACM
Multimedia 99]

Data Reduction Transformations (STFT)
[N. Kosugi et al A pratical query-by-humming system for a large music database.
ACM Multimedia 2000]

Melody slope matching
[Yongwei Zhu et al Pitch tracking and melody slope matching for song retrieval.
Advances in Multimedia Information Processing PCM 2001]

Dynamic Time Warping (DTW) on pitch contour
[Y. Zhu and D. Shasha Warping indexes with envelope transforms for query by
humming. ACM SIGMOD 2003]

String-editing on note sequence combined with rhythm
alignment
[W. Archer A. Methods for retrieving musical information based on rhythm and pitch
correlation CSGSC 2003]
71
Problems with Related Work


Difficult to do performance comparison
 No Standard for Evaluation
 Data set and test set are different
 Definition of accuracy is not reliable
General conclusions nevertheless possible:
 Warped distance measure is more robust
 Need to scale up warped distance matching
72
Experiment on Scaling Up
Top K Hit-rate=
Test set
# recognized in TopK list

41 hummed tunes from Beatles songs
# recognized by human

Collected from both amateurs and professionals

Recognizable by more than three persons

Two data sets
A.
53 Beatles songs (included by B)
B.
1032 songs including 123 Beatles songs, 908 American rock-andpop songs and one Chinese game music
* Both system are optimized using another test set which includes 17 hummings
73

Framework Proposal
rhythm (duration)
& melody (note)
Query criteria
Humming with ‘ta’
segment note/duration melody (note)
notes
sequence
keywords
statistics
based
features
boosting
Boosted
feature
Keyword
Database
filter
Database
filter
Nearest-N
search
Database
Alignment
verifier
on DTW
Top N
distance
match
with
transformed
envelope filter
Top
N’
match
74
Important Heuristic: ‘ta’ Based Humming*


Solve the problem of note segmentation in most
cases
Compare humming with ‘la’
and ‘ta’
* Idea from N. Kosugi et al “A pratical query-by-humming system for a large
music database” ACM Multimedia 2000
75
Benefits of ‘ta’ Style Humming


Decrease the size of time series by orders of
magnitude.
Thus reduce the computation of DTW distance
76
Important Heuristic: Statistics-Based
Filters *

Low dimensional statistics feature



Example


Lower computation cost comparing to DTW distance
Quickly filter out true negatives
Filter out candidates whose note length is much
larger/smaller than the query’s note length
More




Standard Deviation of note values
Zero crossing rate of note values
Number of local minimum/maximum of note values
Histogram of note values
* Intuition from Erling Wold et al “Content-based classification, search and
retrieval of audio” IEEE Multimedia 1996 http://www.musclefish.com
77
Important Heuristic: Boosting

Characteristics of statistics-based filters




Quick but does not guarantee no false negative
Becomes weak classifier for bad parameters setting
Ideal candidates for boosting
Boosting *


“An algorithm for constructing a ‘strong’ classifier using only
a training set and a set of ‘weak’ classification algorithm”
“A particular linear combination of these weak classifiers is
used as the final classifier which has a much smaller
probability of misclassification”
* Cynthia Rudin et al “On the Dynamics of Boosting”
In Advances in Neural Information Processing Systems 2004
78
Important Heuristic: Alignment
Verification


Nearest-N search only used melody information,
which does not guarantee the rhythm will match
Will A. Arentz et al. suggests combining rhythm and
melody information in similarity comparison



Results are generally better than using only melody
information
Not appropriate when the sum of several notes’ duration in
the query may be related to duration of one note in the
candidate
Our method:
1.
2.
3.
4.
Use melody information for DTW distance computing
Reject candidates that have bad local note alignment
Merge durations appropriately based on the note
alignment
Reject candidates which have bad duration alignment
79
Experiments

Data Set

1,032 songs were divided and organized into 73, 051
melodic segments

Computing Environment




Pentium IV 2.4G
768M memory: all the data can be loaded
K: an array processing language
Query Criteria


Return Top 15 list which has DTW distance less than 0.5
Allow 0.05 * N local time shifting for query with N notes
80
Experiment One: Human Humming
Query Set
 109 `ta`-style hummings





The previous 41 hummings for Beatles songs
65 hummings for American rock-and-pop song
3 hummings for the Chinese game music
Recognizable by at least two persons
Number of notes varies from 6 to 24
81
Future Work

Model and estimate the error more accurately





Analyze the relationship between algorithm's performance
and observed humming errors
Build a standard benchmark to evaluate and compare
different QbH systems
Investigate more lower-bounding filters on lower
levels
Investigate more classifiers to boost
Build intelligent system

Self-improving by adjusting to a particular user’s humming
patterns
82
Themes and Approaches




Our approach: take simple problems and make them
fast.
Correlation and related primitives (e.g. matching
pursuit) are simple, but we want to do them for many
thousands of time series incrementally and in near
linear time.
Burst detection over many window sizes in near linear
time.
Query by humming: large scale and accurate.
Coming to a store near you?
83