Transcript Slide 1

A ROBUST, OPTIMIZATIONBASED APPROACH FOR
APPROXIMATE ANSWERING OF
AGGREGATE QUERIES
Written By
Surajit Chaudhuri , Gautam Das , Vivek Marasayya
(Microsoft Research, Washington)
Presented By
Melissa J Fernandes
Goal of the papers we have studied so far is to
approximately answer aggregation queries
Accurately
Efficiently
Main Idea of This Paper : Tailor the choice of
samples to be robust for W similar not necessary
identical to give workloads.
Two ways of Data mining for analyzing Large DB.
1) OLAP :- Online analytical processing is an approach to quickly
answer multi-dimensional analytical queries using OLAP cubes.
OLAP cube :- is a data structure that allows fast analysis of
data.
Example
Data of Product
1) By city
2) By type
3) By product type
Drawbacks:
1) Expensive
2) Resource intensive
-wiki
2) Pre-computed samples
-This method gives approximate answers very efficiently.
-But may have larger errors, cause finding the samples with
large variance is almost impossible.
3) Using Workloads
What is a Workload ?
Set of Transact-SQL statements that execute against a DB
or databases that have to be tuned. - msdn
Tuned :- Optimize performance of DB
How Sample set is generated from Workload?
- In practical world Queries fall under a particular pattern. Eg :
Info of Texas state.
-The queries are run on the entire data base (R).
-A column is added to the records. This column holds the data
that states if the record was selected by the queries. (tagging)
-If the record was selected by many queries its probability of
being in the sample is higher.
-This way Workload is used to generate Sample set.
ICICLES : A new class of
samples that tune themselves
to a dynamic Workload.
R
Icicle
R(Q1)
R(Q2)
R(Q3)
Outliers : Identify the tuples with
Outlier values and store them in
a separate relation.
Outlier table (T1)
1)
2)
3)
4)
Run query on T1
URSAMP on T2
Estimate the true result
Using methods in paper combine
results.
No Outlier values Table (T2)
Drawbacks of Previous studies :1) They have intuitive appeal they lack rigorous problem
formulation.
2) Do not deal with uncertainty in Workloads.
3) Ignore data variance in the data distribution of the
aggregate column.
Architecture for Approximate Query Processing
Offline component for
selecting samples from
(R)
a) Rewrite the queries
to use the sample
to answer the query
approximately.
b) Report the answers
with estimate
errors.
Offline component selects samples from R.
1) Each record has Scale Factor Column (or in different
relation)
2) Value for the aggregate column of each record is scaled
up by multiplying by scale factor and then aggregated.
Error Metrics :
y = correct ans
y’ = approximate ans
Relative error :E(Q) = (|y-y’|)/|y|
Squared relative error : SE(Q) = (|y-y’|)2 / |y|2
Group By Query includes g groups
yi = correct ans for ith group
SE(Q) = (1/g) Σi (yi – yi’) 2 / yi 2
A group by query with g groups
g select queries
with 1/g weight each .
pw = Probability Distribution
pw(Q) = probability of query Q is given
Mean Square error is
MSE(pw) = ΣQ pw(Q) * SE(Q)
Root mean squared error :
RMSE ( p W ) 
MSE ( p W )
Fundamental Regions
Q1 = select no.
between 10 and 50
Q2= select values
between 40 and 70
R = Relation
W = Workspace
n = no. of records in R
R is divided into min no. of regions R1,R2,…… Rr
Rj = each query in the W selects either all records in Rj or none.
Upper Bound on no. of regions is min (2|w| , n)
Fixed Workload
Problem Statement : (Fixed Samp)
Input : R, W, k
Output: A sample of k records such that MSE(W) is minimized.
Soultion : (3 steps)
1) Identify fundamental regions
2) Picking exactly one record from each imp fundamental region
3) Assign appropriate values to additional columns in the sample records.
Step 1 : Number of fundamental regions (r) is induced by the
Workload W.
Case 1 : (r <=k)
Case 2 : (r >k)
____________________________
Case 1: (r<=k)
Step 2 : Pick one sample from each fundamental region
(10,40,60,80)
Step 3: Column RegionCount and AggSum are updated
Region Count = { 3,2,2,2,)
AggSum = {60,90,130,170)
We can ans COUNT , SUM and AVG queries.
Case 2 :- (r>k)
Step 2:
1) Sort all regions by importance
Importance = fj*nj2
fj = sum of weights of all the queries in W that select
the region j.
nj = no. of records in the region j.
(fj) = measures the weights of the queries that are affected by Rj
(nj2) = measures the effect on the error by not including Rj.
2) Pick the top k
Step 3
Assign the values to additional coulmns (Regioncount , AggSum)
Not same as pervious (k rec , 2k unknows ({RC1, … RCk} & {AS1, ,ASk} )
MSE(W) = quadratic eq. on Differentiation we get Linear eq.
Disadvantage : If queries not identical unpredictable errors.
Lifting Workload
Q an Q’ => similar when records selected by them
significantly overlap
R = Relation
Q = Query
RQ = rec selected by the query
p{q} (R’) => denotes the probability of occurrence of any query
that selects the set of records R’.
δ (½ ≤ δ ≤1) and γ (0 ≤ γ ≤ ½)
Define the degree to which W
influences Q’s distribution.
For any record inside RQ
δ
Probability that an incoming query will select this
record
For any record outside RQ
γ
Probability that an incoming query will select this
record
n1, n2, n3, and n4 are the counts of records in the regions.
n2 or n4 large (large overlap), P{Q}(R’) is high
n1 or n3 large (small overlap), P{Q}(R’) is low
δ = inside RQ &selected by R’
γ = outside RQ &selected by R’
δ => 1 γ => 0
RQ and R’ identical
δ => 1 γ => ½ R’ super set of RQ
δ => ½ γ => 0 R’ subset of RQ
δ => ½ γ => ½ R’ is unrestricted
STRATIFIED SAMPLING
PROBLEM : SAMP
Input : R, pw , k (pw probability distribution fun specified by W)
Output : a sample of k records such that MSE(pw) is minimized
Solution STRAT for single-table selection queries with
Aggregation
3 Steps :
1.Stratification
(a) How many strata to partition R into ?
(b) How many records in each strata?
2. Allocation
 Determine the number of samples required across
each strata
3.Sampling
Count Aggregation
1) Stratification :
No of statum = No. of fundamental regions (Lemma1)
2) Allocation :
We want to minimize the error over queries in pw .
k1, … kr are unknown variables such that Σkj = k.
From Equation on an earlier slide,
q
pW (R') 
w
i 1
i
p{Qi } ( R ' )
MSE(pW) can be expressed as a weighted sum of the MSE of each
query in the workload:
Lemma 2: MSE(pW) = Σi wi MSE(p{Q})
Lemma 3 : For a COUNT query Q in W,
let ApproxMSE(p{Q}) =
Then
Expected relative squared error
in estimating count of RQ
Expected squared error
in estimating the count
(RQ union Rj)
Expected squared error
in estimation the count
of (RQ union R/RQ)
Since we have an (approximate) formula for MSE(p{Q}), we can
express MSE(pw) as a function of the kj’s variables.
Corollary 1 : MSE(pw) = Σj(αj / kj), where each αj is a function of
n1,…,nr, δ, and γ.
αj captures the “importance” of a region; it is positively correlated
with nj as well as the frequency of queries in the workload that
access Rj.
Now we can minimize MSE(pw).
Lemma 4: Σj (αj / kj) is minimized subject to Σj kj = k
if kj = k * ( sqrt(αj) / Σi sqrt(αi) )
This provides a closed-form and computationally inexpensive
solution to the allocation problem since αj depends only on
δ, γ and the number of tuples in each fundamental region.
Solution For Sum Aggregate
1) Stratification
-Can not use same stratification as in Count.
-Use Bucketing Technique
Fundamental regions are with large variance are divided into
finer regions with significantly lower internal variance.
Each finer region treated as a strata.
2) Allocation
-Like COUNT, we express an optimization problem with h*r
unknowns k1,…, kh*r.
-Unlike COUNT, the specific values of the aggregate column in each
region (as well as the variance of values in each region) influence
MSE(p{Q}).
-Let yj(Yj) be the average (sum) of the aggregate column values of all
records in region Rj.
-Since the variance = small, so approximate each value in the region
to yj.
Thus to express MSE(p{Q}) as a function of the kj’s for
a SUM query Q in W:
As with COUNT, MSE(pW) for SUM is functionally of the form Σj(αj / kj), and αj
depends on the same parameters n1, …nh*r , δ, and γ (Corollary 1).
Pragmatic Issues
-Identifying Fundamental Regions
-Handling Large Number of Fundamental Regions
-Obtaining Integer Solutions
-Obtaining an Unbiased Estimator
Extensions for more general Workloads
Group Queries : -Q partitions R into g groups
-Lifting model : - Replace Q with g separate selection queries
-Tagging step :- append <c,v>
(c = column id used in Group by. V = value of Group by in
record t.)
Join Query :- Star queries contains
-One source relation and a set of dimension relations connected via
foreign key joins.
-Group by and selection on source and dimension relation
-Aggregation over columns of the source relation.
-Approach 1
- identify samples only over source relation. ( source relation is large)
-Approach 2
- Identify source relation and precompute its join with all dimension
relations.
Experimental Results :
FIXED – solution for FIXEDSAMP, fixed workload, identical queries
STRAT – solution for SAMP, workloads with single-table selection queries with
aggregation
PREVIOUS WORK
USAMP – uniform random sampling
WSAMP – weighted sampling
OTLIDX – outlier indexing combined with weighted sampling
CONG – Congressional sampling
Conclusion:
The solutions FIXED and STRAT handle the
problems of data variance, heterogeneous
mixes of queries, GROUP BY and foreign-key
joins.