COMP9314 Lecture Notes Stabbing the Sky: Efficient Skyline Computation over Sliding Windows

Download Report

Transcript COMP9314 Lecture Notes Stabbing the Sky: Efficient Skyline Computation over Sliding Windows

Stabbing the Sky: Efficient Skyline
Computation over Sliding Windows
COMP9314 Lecture Notes
Outline
•
•
•
•
•
Introduction
n-of-N Queries
(n1, n2)-of-N Queries
Performance Evaluation
Conclusions
COMP9314
Xuemin [email protected]
2
Skyline
Skyline Query:
• Input: a set of points in ddimensional space.
• Output: points not dominated
by another point.
• (x1, x2, …, xd) dominates (y1,
y2, …, yd) iff xi<=yi (1<=i<=d)
& ∃k, xk<yk.
COMP9314
Xuemin [email protected]
3
Applications
Multi-criteria decision
making…
Stock Trading Example:
• What are the top deals?
COMP9314
Xuemin [email protected]
4
Skyline Query Over Sliding Window
Stock Trading Example
• Top deals of a stock in the last 5 mins? last 4 mins, …
• Top deals of a stock in the last 10K deals? …
Queries:
• n-of-N model (n <= N): the most recent n elements
• (n1, n2)-of-N model
• One-time queries
• Continuous queries
COMP9314
Xuemin [email protected]
5
Challenges
Insertions & deletions (possibly high speed).
On-line information
– memory requirement
– processing speed
Existing techniques do not support n-of-N:
[Borzsonyi et al (ICDE01), Tan et al (VLDB01), Kossman et al (VLDB
02), Papadias et al (SIGMOD03), Kapoor (SIAM J. comp00)]
– support the computation of whole dataset
– O (n logd-2 n) for d >= 4 & O (n log n ) otherwise
COMP9314
Xuemin [email protected]
6
Results
n-of-N:
• keep N’ (N’  N) elements where N’ = O (logd N) if data
distribution on each dimension is independent.
• a novel encoding scheme, with O (N’) space, leads to nof-N query time O ( log N’ + s ) instead of O (n logd-2 n).
• a new trigger based technique for continuously
processing an n-of-N query.
– trigger update time: O ( log s).
– result update time: O (logδ) where δ is a result change.
(n1, n2)-of-N: similar results.
COMP9314
Xuemin [email protected]
7
n-of-N Queries
• e is redundant Point e in PN (the most recent N
elements) iff
– e expires w.r.t PN, or
– ∃e’ s.t. e’  e, and e’ is younger than e
N=6
COMP9314
Xuemin [email protected]
8
Optimality
Theorem: Non-redundant Points (RN) vs. n-of-N Skyline
Query Result (Qn,N)
– (PN – RN) does not appear in any Qn,N
– Qn,N must be a subset of RN
– xRN  n, xQn,N
– RN = O(logd-1N) for “independent” distributions
• Only need to keep RN – the minimum number of
elements to be kept.
COMP9314
Xuemin [email protected]
9
Querying RN
• critical dominance: e  e’ where e is the youngest.
• dominance graph GRN: RN and the critical dominance
relationships.
Y
Y
M = 7, N = 7
1
M = 7, N = 7
1
2
7
2
7
6
6
5
5
3
3
4
COMP9314
X
Xuemin [email protected]
4
X
10
Querying RN
e  Qn,N iff
• e is a root in GRN or
• e’  e in GRN & e’ has expired  t(e’) < M – n + 1 <= t(e)
n
RN
M-n+1
n=5 {3,4,5,6,7 } 3
Qn,N
Y
M = 7, N = 7
1
{3,4}
2
7
6
n=4 {4,5,6,7 }
4
{4, 7}
n=3 {5,6,7}
5
{5,6,7}
5
3
4
COMP9314
Xuemin [email protected]
X
11
Querying RN: Optimal Algorithm
To answer an n-of-N Query, encode the GRN using intervals:
• Stab the intervals by (M-n+1).
• For all returned intervals (x,y), return point whose timestamp is y
• Technique: Use an interval tree index to achieve optimal O(log|RN|+s)
query time
Y
Y
M = 7, N = 7
M = 7, N = 7
1
1
2
7
2
7
e.g., n=6
6
6
(3,7]
(4,6]
5
3
(0,3]
3
4
COMP9314
X
(4,5]
4
(0,4]
Xuemin [email protected]
5
X
(0,3]
(0,4]
(3,7]
(4,5]
(4,6]
12
Maintaining RN
new element enew arrives:
•
•
•
If the oldest eold  RN expires, remove eold and update RN and GRN (interval
tree).
find D  RN dominated by enew, update RN and GRN
– Depth-first search on a R-tree of RN
find e c enew, update GRN
– Best-first search on the R-tree of RN
Y
Y
M = 8, N = 5
M = 8, N = 5
enew = 8
eold = 3
D = {6}
e=4
7
6
8
5
3
4
COMP9314
7
6
8
5
3
X
Xuemin [email protected]
4
X
13
Continuous n-of-N Query
Trigger-based algorithm:
•
•
•
Deletion: Qn,N – {eold}, and Qn,N – {D}
Insertion: Qn,N  {enew} if (e’ c enew and t(e’ ) >= M-n+1)
Maintain a min-heap of Qn,N for efficiency
Y
enew = 8
M=8
eold = 3
D = {6}
e’ = 4
M = 8, N = 5
7
6
8
M=7
M=8
n = 4, N=5 n = 4, N=5
Q4,5 = {4,7} Q4,5 = {5,7,8}
5
3
4
COMP9314
X
Q5,5 = {3,4}
Xuemin [email protected]
Q5,5 = {4,7}
14
(n1,n2)-of-N Query
More complicated than n-of-N Query
•
•
•
•
•
PN needs to be kept!
(Old) critical dominance: t (ae) = max { t (e’): e’  e & t (e’) < t (e) }
backward critical dominance: t (be) = min { t (e’): e’  e & t (e’) > t (e)}
e  Q(n1,n2),N iff ae < M-n2+1  e  M-n1+1 < be
CBC dominance graph: PN & the two kinds of dominance relationships
Y
Y
(2,4)-of-7:
{4, 6}
M = 7, N = 7
n1 = 4, n2 = 6
2
4
2
M = 7, N = 7
4
5
5
3
6
1
COMP9314
7
X
a5 = 3, b5 = 6
(M-n2+1, M-n1+1]
= (4,6]
3
Xuemin [email protected]
6
1
7
X
15
Processing (n1,n2)-of-N Query
Encode the CBC dominance graph:
– e  ((ae, e], be)
– build an interval tree on (ae, e] only
Stab using M-n2+1 against the interval tree and check e <= M-n1+1 < be)
on-the-fly:
– O(logN+s*), sub-optimal
Y
2
(1,6]
(3,4]
(3,5]
...
M = 7, N = 7
4
5
(2,4)-of-7: ???
(M-n2+1, M-n1+1] = (4,6]
3
6
(2,4)-of-7: {4, 6}
1
COMP9314
7
Candidates: {4, 5, 6}
X
Xuemin [email protected]
16
More on (n1,n2)-of-N Query
Maintenance: Similar to that of n-of-N query, but
– Always expires the oldest element in PN, and maintain the
interval tree and the R-tree on RN.
– Implementation-wise: Use two interval trees to index RN and PNRN, respectively.
Continuous queries
– More complicated
• A new skyline point might not be a skyline in the previous result,
• nor critically dominated by a skyline point in the previous result
• nor a newly arrived point
– Basic idea
• Maintain additional Candidate Solutions (minimization) & triggers
– Details in the full paper
COMP9314
Xuemin [email protected]
17
Experiment Setup
• Hardware
– P4 2.8G CPU, 1G Memory
• Datasets
– Correlated, independent, and anticorrelated
– d = 2 to 5, N = 106
• Algorithms
– KLP, nN, mnN, cnN, n12N, mn12N
• Metrics
– Processing time  Streaming rate
COMP9314
Xuemin [email protected]
18
n-of-N Query
• Varying dimensionality
– M up to 2M, N = 1M, n uniformly from [1K,
1M], #queries = 1000
COMP9314
Xuemin [email protected]
19
n-of-N Query (cont’d)
• Varying n
• for correlated, independent, and anti-correlated datasets
COMP9314
Xuemin [email protected]
20
Maintenance Costs
2d and 5d datasets, measure average and max time, N = i * 105
COMP9314
Xuemin [email protected]
21
Scalability
M (total number) = 2M, N = 1M, #queries = 2M
independent
COMP9314
anti-correlated
Xuemin [email protected]
22
Continuous n-of-N Queries
•
•
•
•
2d & 5d datasets
N = 10K and 1M
10 queries with n = i*(N/10)
measures cnN avg, cnN max, nN avg, nN max
COMP9314
Xuemin [email protected]
23
(n1,n2)-of-N Queries
Varying dimensionality
– M up to 2M, N = 1M, #queries = 1000
– restricting n2 – n1 >= 500
Scalability
– M = 2M, N = 1M, #queries = 2M
COMP9314
Xuemin [email protected]
24
Maintenance
• 2d and 5d datasets
• measure average and max time
• N = i * 105
COMP9314
Xuemin [email protected]
25
Conclusions
• Efficient algorithms for various sliding windows skyline
queries
– Keep only minimum number of points
– Encode and index those points
– Maintain all the data structures
• The proposed solutions
– have theoretical guarantee on the performance, and
– have demonstrated efficiency and scalability in the experiments
• Future work
– Improve the current solution for (n1,n2)-of-N queries
– Approximate skyline queries
COMP9314
Xuemin [email protected]
26
Q&A
Thank You!
COMP9314
Xuemin [email protected]
27
Reference
•
[ICDE01] S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline
operator. ICDE, 2001.
•
[VLDB01] K. Tan, P. Eng, and B. Ooi. Efficient progressive skyline
computation. VLDB, 2001.
•
[VLDB 02] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in
the sky: An online algorithm for skyline queries. VLDB, 2002.
•
[SIGMOD03] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal
progressive alogrithm for skyline queries. SIGMOD, 2003.
•
[SIAM J. comp00] S. Kapoor. Dynamic maintenance of maxima of 2-
d point sets. SIAM J. Comput., 2000.
COMP9314
Xuemin [email protected]
28