Transcript Document

Deriving Predicate Statistics
(SDP) in Datalog
Principles and Practice of Declarative Programming
12th International ACM SIGPLAN Symposium
July 26, 2010, Hagenberg, Austria
Senlin Liang and Michael Kifer
Stony Brook University
Summary of Our Approach

Motivation

Take advantage of cost-based optimizations in
deductive database systems




Compute cost information (predicate statistics)
Store and retrieve cost information efficiently
Apply optimization techniques
Advantages of our approach



Keeps argument dependencies
Handles recursion
Handles negation
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
2
Outline

Introduction



SDP




Traditional approach: histograms + argument
independence assumption
Error grows exponentially
Dependency matrix stores predicate statistics
Abstract interpretation of Datalog rules, which are
evaluated over dependency matrices
Experimental studies
Future work
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
3
Histograms

Data distribution: T=((v1, f1), ……, (vn,fn)).


Histograms




E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))
Partition data distribution into groups
Summarize each group as a bucket:
(floor, ceiling, size, count)
Compute the values and frequencies in each
bucket efficiently
MaxDiff histograms with β buckets

Partition T using β-1 largest frequency differences
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
4
Example: MaxDiff Histograms (3 buckets)
1. Partition T using 2 largest frequency differences
2. Summarize as (floor, ceiling, size, count)
3. Value-frequency approximation
vals(bucket) = [floor, ceiling];
f(val) = count/size, e.g. f(7)=5/3
T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))
ooo
1
2
PPDP July 26, 2010
1
1
2
2
1
0
T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2))
(2,4,3,4)
(5,5,1,3)
(6,8,3,5)
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
5
Argument Independence Assumption


Common in database size estimates
Data distributions of different arguments are
independent of each other



For example, in predicate p(X,Y), the data
distributions of X and Y are independent
Joint data distribution can be easily computed
from individual distributions
E.g., p(X=a, Y=b) = p(X=a) × p(Y=b)
Unfortunately, the independence assumption
is almost always wrong in real datasets
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
6
Example: Histogram+Independence = Poor Estimate



answer(X,Y) :- e(X,Y), 5 ≤X≤7.
Facts: e(2,2), … as in Example 1 of the paper.
Histogram buckets of e



X: (2,4,3,4) (5,5,1,3) (6,8,3,5)
Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)
Size estimate


Answer size estimate for each bucket
size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]|
× count
size(answer) = 6.33
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
7
Example: Histogram+Independence = Poor Estimate

Histogram buckets of e



Histogram buckets of answer




X: (2,4,3,4) (5,5,1,3) (6,8,3,5)
Y: (1,1,1,1) (2,4,3,3) (5,8,4,8)
X: (5,5,1,3) (6,7,2,3.33)
Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22)
answer.count = e.count ×size(answer)/size(e)
Real results for answer.Y


(1,1,1,0) (2,4,3,0) (5,8,4,6)
Independence causes information loss
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
8
Our Approach: Dependency Matrices




Only considers dependency matrices (DM)
for binary predicates
Partitions facts into local groups
Sum up the groups into DM values
Sum up each row/column into
(floor, ceiling, size)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
9
Example: DM

Fact Matrix
F(i,j) = 1 iff p(i,j) is a fact


Partition fact matrix using
MaxDiff
Sum up partitions into
matrix values
F
1
2
2
3
4
5
6
7
8
1
1
1
1
7
8
1
3
4
1
5
1
6
1
7
8
1
F
1
1
1
5
6
1
2
3
4
2
2
3
2
4
5
3
6
7
1
1
3
8
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
10
Example: DM

Fact Matrix
F(i,j) = 1 iff p(i,j) is a fact
F
1
2
3
4
5
6
7
8
2
2
3
2
4
5
3
6



1
7
Partition fact matrix using
8
MaxDiff
Sum up partitions into
matrix values
Sum up each row/column,
(2,4,3)
into (floor,ceiling,size)
PPDP July 26, 2010
1
M
3
1
1
(5,5,1)
2
(6,8,3)
3
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
2
3
2
2
3
1
1
3
11
SDP for Selection by Example


answer(X,Y) :- e(X,Y), 5 ≤X≤7.
From fact matrix, we
F
know that
2
size(answer)
3
4
= ΣF(i,j) for 5 ≤ i ≤ 7
5
6
=6
7
8
PPDP July 26, 2010
1
2
3
4
5
6
7
8
1
1
1
1
1
1
1
1
1
1
1
1
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
12
SDP for Selection by Example


answer(X,Y) :e(X,Y), 5 ≤X≤7.
Extract the portions
covered by the selection



Recompute matrix values
Sum them up as
size(answer)=3+.67+.67+2
=6.34
For each row, recompute
(floor, ceiling, size)
PPDP July 26, 2010
M
(2,4,3)
1
(5,5,1)
2
(6,8,3)
3
d
(5,5,1) 1
(6,7,2) 2
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
1
2
3
2
2
3
1
1
1
3
2
.67 .67
3
3
2
13
Example: Sort-Merge-Join


answer(X,Z) :- a(X,Y), b(Y,Z)
middle(X,Y,Z) is for the ease of explanation
......
a(4,3)
a(4,4)
……
……
b(3,1)
b(3,5)
b(4,5)
……
middle
(4,3,1)
(4,3,5)
(4,4,5)
……
answer
(4,1)
(4,5)
(4,5)
……
Duplicates!
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
14
SDP for Join by Example


answer(X,Z) :- a(X,Y), b(Y,Z).
Simulate Sort-Merge-Join
(1,1,1) (2,4,2)
A
1
2
(2,4,3)
1
2
4
(5,5,1)
2
(6,8,2) (9,9,1)
B
1
(1,1,1)
1
1
(2,4,2)
2
3
2
align
A.X, A.Y, A.Val
(2,4,3), (1,1,1), 2
(2,4,3), (2,4,2), 4
PPDP July 26, 2010
B.Y,
B.Z, B.Val
(1,1,1), (6,8,2), 1
(2,4,2), (6,8,2), 3
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
15
SDP for Join by Example

answer(X,Z) :- a(X,Y), b(Y,Z).
A.X,
A.Y, A.Val
(2,4,3), (1,1,1), 2
(2,4,3), (2,4,2), 4

B.Y, B.Z, B.Val
(1,1,1), (6,8,2), 1
(2,4,2), (6,8,2), 3
Result size of middle(X,Y,Z) can be estimated as
min(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size)

Examples:


size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) × (2/1) × (1/1)
size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) × (4/2) × (3/2)
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
16
SDP for Join by Example


answer(X,Z) :- a(X,Y), b(Y,Z).
Examples:
Duplicates!
middle((2,4,3),(1,1,1),(6,8,2))  answer((2,4,3),(6,8,2))
middle((2,4,3),(2,4,2),(6,8,2))  answer((2,4,3),(6,8,2))

Three duplicate handling approaches



Sum: no duplicate removal
Max: most aggressive removal
Expected sum: remove “expected” number of
duplicates
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
17
SDP for Recursive Predicates


Recursive predicates are computed
incrementally until they reach approximate
fixed points
Size reaches α-approximate fixed point if
Δ(size)/size ≤ α
where


Δ(…) is the difference between two consecutive
iterations in fixed point computation
0≤α≤1
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
18
Example: Recursive Predicates

Transitive closure
path(X,Y) :- edge(X,Y).
path(X,Y) :- edge(X,Z), path(Z,Y).

(base)
(rec)
Computation of the estimate:
1. Compute size(path) and DM(path) using rule base
2. Compute size(path) and DM(path) using rule rec as
in the case of a join
3. If size(path) reaches approximate fixed points,
stop; Otherwise, go to step 2
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
19
Experimental Studies

Test programs:




Transitive closure
General same generation
Datasets: generated with Thomas Process
and Matern Cluster Process
Results



SDP estimates converge to real sizes for
recursive predicates
Expected sum is good for duplicate removal
Details in the paper
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
20
Experimental Studies

SDP estimates converge to real sizes for
recursive predicates
Transitive Closure
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
21
Experimental Studies

Expected sum is good for duplicate removal
Transitive Closure
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
22
Conclusion




Dependency matrix for binary predicates
Overcomes problems with argument
independence assumption
SDP for selection, join, and recursion
Experimental validations
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
23
Future works




More complex recursions
Negation
Extending SDP to n-ary predicates
Apply cost-based optimization in deductive
systems, such as XSB
PPDP July 26, 2010
“Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer
24