Transcript Sliding

Approximation and Load Shedding
for QoS in DSMS*
CS240B Notes
By
Carlo Zaniolo
CSD--UCLA
________________________________________
* Notes based on a VLDB’02 tutorial by Minos
Garofalakis, Johannes Gehrke, and Rajeev Rastogi
1
Synopses and Approximation
 Synopsis: bounded-memory history-approximation
Succinct summary of old stream tuples
Like indexes/materialized-views, but base data is
unavailable
Examples
Sliding Windows
Samples
Histograms
Wavelet representation
Sketching techniques
 Approximate Algorithms: e.g., median, quantiles,…
 Fast and light Data Mining algorithms
2
Overview of Stream Synopses
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line
quantile computation
Wavelets: Haar-wavelet histogram
construction & maintenance
3
Sampling: Basics
• Idea: A small random sample S of the data often wellrepresents all the data
– For a fast approx answer, apply “modified” query to S
– Example: select agg from R where odd(R.e)
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1 (n=12)
Sample S: 9 5 1 8
– If agg is avg, return average of odd elements in S answer: 5
– If agg is count, return average over all elements e in S of
• 1 if e is odd
answer: 12*3/4 =9
• 0 if e is even
Unbiased: For expressions involving count, sum, avg: the estimator
is unbiased, i.e., the expected value of the answer is the actual answer
Garofalakis, Gehrke, Rastogi, VLDB’02 #
4
Probabilistic Guarantees
Example: Actual answer is within 5 ± 1
with prob  0.9
Use Tail Inequalities to give probabilistic
bounds on returned answer
Markov Inequality
Chebyshev’s Inequality
Hoeffding’s Inequality
Chernoff Bound
5
Sampling—some background
 Reservoir Sampling [Vit85]: Maintains a sample S having a preassigned size M on a stream of arbitrary size
Add each new element to S with probability M/n, where n is
the current number of stream elements
If add an element, evict a random element from S
Instead of flipping a coin for each element, determine the
number of elements to skip before the next to be added to
S
 Concise sampling [GM98]: Duplicates in sample S stored as
<value, count> pairs (thus, potentially boosting actual sample
size)
 Counting Samples [GM98]: for answering hot list queries (k most
frequent values)
 Window Sampling [BDM02,BOZ08]. Maintains a sample S having
a pre-assigned size M on a window on a stream—reservoir
sampling with expiring tuples.
6
Load Shedding Using Samples
Given a complex Query graph how to
use/manage the sampling process [BDM04]
More about this later [LawZ02]
7
Overview
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line
quantile computation
Wavelets: Haar-wavelet histogram
construction & maintenance
Sketches
8
Histograms
 Histograms approximate the frequency distribution of element
values in a stream
 A histogram (typically) consists of
 A partitioning of element domain values into buckets
 A count
per bucket B (of the number of elements in B)
 Widely used in DBMS query optimization
Many Types of Proposed:
 Equi-Depth Histograms: select buckets such that counts per
bucket are equal
 V-Optimal Histograms: select buckets to minimize frequency
variance within buckets
 Wavelet-based Histograms
9
Types of Histograms
• Equi-Depth Histograms
– Idea: Select buckets such that counts per bucket are equal
Count for
bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Domain values
• V-Optimal Histograms [IP95] [JKM98]
– Idea: Select buckets to minimize frequency variance within buckets
CB 2
minimize B vB ( f v  )
VB
Count for
bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Domain values
Garofalakis, Gehrke, Rastogi, VLDB’02 #
10
Equi-Depth Histogram Construction
 For histogram with b buckets, compute elements
with rank n/b, 2n/b, ..., (b-1)n/b
 Example: (n=12, b=4)
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1
After sort: 1 1 2 3 4 5 5 6 7 8 9 9
rank = 9
rank = 3
(.75-quantile)
(.25-quantile)
rank = 6
(.5-quantile)
11
Answering Queries Histograms [IP99]
• (Implicitly) map the histogram back to an approximate
relation, & apply the query to the approximate relation
• Example: select count(*) from R where 4 <= R.e <= 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count spread
evenly among
bucket values
4  R.e  15
answer: 3.5 * C B
• For equi-depth histograms, maximum error:  2 * CB
Garofalakis, Gehrke, Rastogi, VLDB’02 #
12
Approximate Algorithms
 Quantiles Using Samples
 Quantiles from Synopses
 One pass algorithms for approximate samples …
 Much work in this area … omitted
13
Overview
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line
quantile computation
Wavelets: Haar-wavelet histogram
construction & maintenance
Sketches
14
One-Dimensional Haar Wavelets
• Wavelets: Mathematical tool for hierarchical decomposition
of functions/signals
• Haar wavelets: Simplest wavelet basis, easy to understand
and implement
– Recursive pairwise averaging and differencing at different
resolutions
Resolution
3
2
1
0
Averages
Detail Coefficients
[2, 2, 0, 2, 3, 5, 4, 4]
[2,
1,
4,
[1.5,
4]
[2.75]
Haar wavelet decomposition:
4]
---[0, -1, -1, 0]
[0.5, 0]
[-1.25]
[2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
Garofalakis, Gehrke, Rastogi, VLDB’02 #
15
Haar Wavelet Coefficients
• Hierarchical decomposition structure
(a.k.a. “error tree”)
Coefficient “Supports”
2.75
+
0.5
+
+
2
0
+
2
0
-
-
+
-1
-1
- +
2
3
0.5
0
-
-
+
+
0
0
- +
5
-
+
-1.25
-1.25
+
+
2.75
4
Original frequency distribution
-
0
4
-1
-1
0
+
-
+
-
+
-
+
-
Garofalakis, Gehrke, Rastogi, VLDB’02 #
16
Compressed Wavelet Representations
Key idea: Use a compact subset of Haar/linear wavelet
coefficients for approximating frequency
distribution
Steps
 Compute cumulative frequency distribution C
 Compute linear wavelet transform of C
 Greedy heuristic methods
Retain coefficients leading to large error reduction
Throw away coefficients that give small increase in error
17
Overview
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line
quantile computation
Wavelets: Haar-wavelet histogram
construction & maintenance
Sketches
18
Sketches
 Conventional data summaries fall short:
 Quantiles and 1-d histograms: Cannot capture attribute
correlations
 Samples (e.g., using Reservoir Sampling) perform poorly for joins
 Multi-d histograms/wavelets: Construction requires multiple passes
over the data
 Different approach: Randomized sketch synopses
 Only logarithmic space
 Probabilistic guarantees on the quality of the approximate answer
 Can handle extreme cases.
19
Overview
 Windows: logical, physical (covered)
 Samples: Answering queries using samples
 Histograms: Equi-depth histograms, On-line quantile
computation
 Wavelets: Haar-wavelet histogram construction &
maintenance
Sketches
QoS by load shedding.
20
QoS and Load Schedding
When input stream rate exceeds system capacity
a stream manager can shed load (tuples)
 Load shedding affects queries and their answers:
drop the tasks and the tuples that will cause least
loss
 Introducing load shedding in a data stream
manager is a challenging problem
 Random load shedding or semantic load shedding
21
Load Shedding in Aurora
 QoS for each application as a function
relating output to its utility
– Delay based, drop based, value based
Techniques for introducing load shedding
operators in a plan such that QoS isdisrupted
the least
– Determining when, where and how much
load to shed
22
Load Shedding in STREAM
Formulate load shedding as an optimization
problem for multiple sliding window aggregate
queries
– Minimize inaccuracy in answers subject to
output rate matching or exceeding arrival
rate
Consider placement of load shedding
operators in query plan
– Each operator sheds load uniformly with
probability pi
23
References
[BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming
data”, Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms,
p.633–634, 2002.
[BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams,
submitted for publication.
[Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985.
[GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving
Approximate Query Answers”. ACM SIGMOD 1998.
[BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries
over Data Streams. ICDE 2004: 350-361.
[lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and
Mining Queries on Data Streams under Load Shedding. International Journal of Business
Intelligence and Data Mining, 2008.
24