Transcript Document
Deriving Predicate Statistics
(SDP) in Datalog
Principles and Practice of Declarative Programming
12th International ACM SIGPLAN Symposium
July 26, 2010, Hagenberg, Austria
Senlin Liang and Michael Kifer
Stony Brook University
Summary of Our Approach
Motivation
Take advantage of cost-based optimizations in
deductive database systems
Compute cost information (predicate statistics)
Store and retrieve cost information efficiently
Apply optimization techniques
Advantages of our approach
Keeps argument dependencies
Handles recursion
Handles negation
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
2
Outline
Introduction
SDP
Traditional approach: histograms + argument
independence assumption
Error grows exponentially
Dependency matrix stores predicate statistics
Abstract interpretation of Datalog rules, which are
evaluated over dependency matrices
Experimental studies
Future work
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
3
Histograms
Data distribution: T=((v1, f1), ……, (vn,fn)).
Histograms
E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))
Partition data distribution into groups
Summarize each group as a bucket:
(floor, ceiling, size, count)
Compute the values and frequencies in each
bucket efficiently
MaxDiff histograms with β buckets
Partition T using β-1 largest frequency differences
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
4
Example: MaxDiff Histograms (3 buckets)
1. Partition T using 2 largest frequency differences
2. Summarize as (floor, ceiling, size, count)
3. Value-frequency approximation
vals(bucket) = [floor, ceiling];
f(val) = count/size, e.g. f(7)=5/3
T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))
ooo
1
2
PPDP July 26, 2010
1
1
2
2
1
0
T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))
(2,4,3,4)
(5,5,1,3)
(6,8,3,5)
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
5
Argument Independence Assumption
Common in database size estimates
Data distributions of different arguments are
independent of each other
For example, in predicate p(X,Y), the data
distributions of X and Y are independent
Joint data distribution can be easily computed
from individual distributions
E.g., p(X=a, Y=b) = p(X=a) × p(Y=b)
Unfortunately, the independence assumption
is almost always wrong in real datasets
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
6
Example: Histogram+Independence = Poor Estimate
answer(X,Y) :- e(X,Y), 5 ≤X≤7.
Facts: e(2,2), … as in Example 1 of the paper.
Histogram buckets of e
X: (2,4,3,4) (5,5,1,3) (6,8,3,5)
Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)
Size estimate
Answer size estimate for each bucket
size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]|
× count
size(answer) = 6.33
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
7
Example: Histogram+Independence = Poor Estimate
Histogram buckets of e
Histogram buckets of answer
X: (2,4,3,4) (5,5,1,3) (6,8,3,5)
Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)
X: (5,5,1,3) (6,7,2,3.33)
Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22)
answer.count = e.count ×size(answer)/size(e)
Real results for answer.Y
(1,1,1,0) (2,4,3,0) (5,8,4,6)
Independence causes information loss
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
8
Our Approach: Dependency Matrices
Only considers dependency matrices (DM)
for binary predicates
Partitions facts into local groups
Sum up the groups into DM values
Sum up each row/column into
(floor, ceiling, size)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
9
Example: DM
Fact Matrix
F(i,j) = 1 iff p(i,j) is a fact
Partition fact matrix using
MaxDiff
Sum up partitions into
matrix values
F
1
2
2
3
4
5
6
7
8
1
1
1
1
7
8
1
3
4
1
5
1
6
1
7
8
1
F
1
1
1
5
6
1
2
3
4
2
2
3
2
4
5
3
6
7
1
1
3
8
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
10
Example: DM
Fact Matrix
F(i,j) = 1 iff p(i,j) is a fact
F
1
2
3
4
5
6
7
8
2
2
3
2
4
5
3
6
1
7
Partition fact matrix using
8
MaxDiff
Sum up partitions into
matrix values
Sum up each row/column,
(2,4,3)
into (floor,ceiling,size)
PPDP July 26, 2010
1
M
3
1
1
(5,5,1)
2
(6,8,3)
3
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
2
3
2
2
3
1
1
3
11
SDP for Selection by Example
answer(X,Y) :- e(X,Y), 5 ≤X≤7.
From fact matrix, we
F
know that
2
size(answer)
3
4
= ΣF(i,j) for 5 ≤ i ≤ 7
5
6
=6
7
8
PPDP July 26, 2010
1
2
3
4
5
6
7
8
1
1
1
1
1
1
1
1
1
1
1
1
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
12
SDP for Selection by Example
answer(X,Y) :e(X,Y), 5 ≤X≤7.
Extract the portions
covered by the selection
Recompute matrix values
Sum them up as
size(answer)=3+.67+.67+2
=6.34
For each row, recompute
(floor, ceiling, size)
PPDP July 26, 2010
M
(2,4,3)
1
(5,5,1)
2
(6,8,3)
3
d
(5,5,1) 1
(6,7,2) 2
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
1
2
3
2
2
3
1
1
1
3
2
.67 .67
3
3
2
13
Example: Sort-Merge-Join
answer(X,Z) :- a(X,Y), b(Y,Z)
middle(X,Y,Z) is for the ease of explanation
......
a(4,3)
a(4,4)
……
……
b(3,1)
b(3,5)
b(4,5)
……
middle
(4,3,1)
(4,3,5)
(4,4,5)
……
answer
(4,1)
(4,5)
(4,5)
……
Duplicates!
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
14
SDP for Join by Example
answer(X,Z) :- a(X,Y), b(Y,Z).
Simulate Sort-Merge-Join
(1,1,1) (2,4,2)
A
1
2
(2,4,3)
1
2
4
(5,5,1)
2
(6,8,2) (9,9,1)
B
1
(1,1,1)
1
1
(2,4,2)
2
3
2
align
A.X, A.Y, A.Val
(2,4,3), (1,1,1), 2
(2,4,3), (2,4,2), 4
PPDP July 26, 2010
B.Y,
B.Z, B.Val
(1,1,1), (6,8,2), 1
(2,4,2), (6,8,2), 3
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
15
SDP for Join by Example
answer(X,Z) :- a(X,Y), b(Y,Z).
A.X,
A.Y, A.Val
(2,4,3), (1,1,1), 2
(2,4,3), (2,4,2), 4
B.Y, B.Z, B.Val
(1,1,1), (6,8,2), 1
(2,4,2), (6,8,2), 3
Result size of middle(X,Y,Z) can be estimated as
min(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size)
Examples:
size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) × (2/1) × (1/1)
size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) × (4/2) × (3/2)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
16
SDP for Join by Example
answer(X,Z) :- a(X,Y), b(Y,Z).
Examples:
Duplicates!
middle((2,4,3),(1,1,1),(6,8,2)) answer((2,4,3),(6,8,2))
middle((2,4,3),(2,4,2),(6,8,2)) answer((2,4,3),(6,8,2))
Three duplicate handling approaches
Sum: no duplicate removal
Max: most aggressive removal
Expected sum: remove “expected” number of
duplicates
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
17
SDP for Recursive Predicates
Recursive predicates are computed
incrementally until they reach approximate
fixed points
Size reaches α-approximate fixed point if
Δ(size)/size ≤ α
where
Δ(…) is the difference between two consecutive
iterations in fixed point computation
0≤α≤1
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
18
Example: Recursive Predicates
Transitive closure
path(X,Y) :- edge(X,Y).
path(X,Y) :- edge(X,Z), path(Z,Y).
(base)
(rec)
Computation of the estimate:
1. Compute size(path) and DM(path) using rule base
2. Compute size(path) and DM(path) using rule rec as
in the case of a join
3. If size(path) reaches approximate fixed points,
stop; Otherwise, go to step 2
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
19
Experimental Studies
Test programs:
Transitive closure
General same generation
Datasets: generated with Thomas Process
and Matern Cluster Process
Results
SDP estimates converge to real sizes for
recursive predicates
Expected sum is good for duplicate removal
Details in the paper
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
20
Experimental Studies
SDP estimates converge to real sizes for
recursive predicates
Transitive Closure
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
21
Experimental Studies
Expected sum is good for duplicate removal
Transitive Closure
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
22
Conclusion
Dependency matrix for binary predicates
Overcomes problems with argument
independence assumption
SDP for selection, join, and recursion
Experimental validations
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
23
Future works
More complex recursions
Negation
Extending SDP to n-ary predicates
Apply cost-based optimization in deductive
systems, such as XSB
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
24