Transcript Document
Deriving Predicate Statistics (SDP) in Datalog Principles and Practice of Declarative Programming 12th International ACM SIGPLAN Symposium July 26, 2010, Hagenberg, Austria Senlin Liang and Michael Kifer Stony Brook University Summary of Our Approach Motivation Take advantage of cost-based optimizations in deductive database systems Compute cost information (predicate statistics) Store and retrieve cost information efficiently Apply optimization techniques Advantages of our approach Keeps argument dependencies Handles recursion Handles negation PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 2 Outline Introduction SDP Traditional approach: histograms + argument independence assumption Error grows exponentially Dependency matrix stores predicate statistics Abstract interpretation of Datalog rules, which are evaluated over dependency matrices Experimental studies Future work PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 3 Histograms Data distribution: T=((v1, f1), ……, (vn,fn)). Histograms E.g. ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) Partition data distribution into groups Summarize each group as a bucket: (floor, ceiling, size, count) Compute the values and frequencies in each bucket efficiently MaxDiff histograms with β buckets Partition T using β-1 largest frequency differences PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 4 Example: MaxDiff Histograms (3 buckets) 1. Partition T using 2 largest frequency differences 2. Summarize as (floor, ceiling, size, count) 3. Value-frequency approximation vals(bucket) = [floor, ceiling]; f(val) = count/size, e.g. f(7)=5/3 T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) ooo 1 2 PPDP July 26, 2010 1 1 2 2 1 0 T= ((2,1), (3,2), (4,1), (5,3), (6,1), (7,2), (8,2)) (2,4,3,4) (5,5,1,3) (6,8,3,5) “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 5 Argument Independence Assumption Common in database size estimates Data distributions of different arguments are independent of each other For example, in predicate p(X,Y), the data distributions of X and Y are independent Joint data distribution can be easily computed from individual distributions E.g., p(X=a, Y=b) = p(X=a) × p(Y=b) Unfortunately, the independence assumption is almost always wrong in real datasets PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 6 Example: Histogram+Independence = Poor Estimate answer(X,Y) :- e(X,Y), 5 ≤X≤7. Facts: e(2,2), … as in Example 1 of the paper. Histogram buckets of e X: (2,4,3,4) (5,5,1,3) (6,8,3,5) Y: (1,1,1,1) (2,4,3,3) (5,8,4,8) Size estimate Answer size estimate for each bucket size(answer) = |[floor, ceiling] ∩ [5,7]|/|[floor, ceiling]| × count size(answer) = 6.33 PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 7 Example: Histogram+Independence = Poor Estimate Histogram buckets of e Histogram buckets of answer X: (2,4,3,4) (5,5,1,3) (6,8,3,5) Y: (1,1,1,1) (2,4,3,3) (5,8,4,8) X: (5,5,1,3) (6,7,2,3.33) Y: (1,1,1,0.53) (2,4,3,1.58) (5,8,4,4.22) answer.count = e.count ×size(answer)/size(e) Real results for answer.Y (1,1,1,0) (2,4,3,0) (5,8,4,6) Independence causes information loss PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 8 Our Approach: Dependency Matrices Only considers dependency matrices (DM) for binary predicates Partitions facts into local groups Sum up the groups into DM values Sum up each row/column into (floor, ceiling, size) PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 9 Example: DM Fact Matrix F(i,j) = 1 iff p(i,j) is a fact Partition fact matrix using MaxDiff Sum up partitions into matrix values F 1 2 2 3 4 5 6 7 8 1 1 1 1 7 8 1 3 4 1 5 1 6 1 7 8 1 F 1 1 1 5 6 1 2 3 4 2 2 3 2 4 5 3 6 7 1 1 3 8 PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 10 Example: DM Fact Matrix F(i,j) = 1 iff p(i,j) is a fact F 1 2 3 4 5 6 7 8 2 2 3 2 4 5 3 6 1 7 Partition fact matrix using 8 MaxDiff Sum up partitions into matrix values Sum up each row/column, (2,4,3) into (floor,ceiling,size) PPDP July 26, 2010 1 M 3 1 1 (5,5,1) 2 (6,8,3) 3 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 2 3 2 2 3 1 1 3 11 SDP for Selection by Example answer(X,Y) :- e(X,Y), 5 ≤X≤7. From fact matrix, we F know that 2 size(answer) 3 4 = ΣF(i,j) for 5 ≤ i ≤ 7 5 6 =6 7 8 PPDP July 26, 2010 1 2 3 4 5 6 7 8 1 1 1 1 1 1 1 1 1 1 1 1 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 12 SDP for Selection by Example answer(X,Y) :e(X,Y), 5 ≤X≤7. Extract the portions covered by the selection Recompute matrix values Sum them up as size(answer)=3+.67+.67+2 =6.34 For each row, recompute (floor, ceiling, size) PPDP July 26, 2010 M (2,4,3) 1 (5,5,1) 2 (6,8,3) 3 d (5,5,1) 1 (6,7,2) 2 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 1 2 3 2 2 3 1 1 1 3 2 .67 .67 3 3 2 13 Example: Sort-Merge-Join answer(X,Z) :- a(X,Y), b(Y,Z) middle(X,Y,Z) is for the ease of explanation ...... a(4,3) a(4,4) …… …… b(3,1) b(3,5) b(4,5) …… middle (4,3,1) (4,3,5) (4,4,5) …… answer (4,1) (4,5) (4,5) …… Duplicates! PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 14 SDP for Join by Example answer(X,Z) :- a(X,Y), b(Y,Z). Simulate Sort-Merge-Join (1,1,1) (2,4,2) A 1 2 (2,4,3) 1 2 4 (5,5,1) 2 (6,8,2) (9,9,1) B 1 (1,1,1) 1 1 (2,4,2) 2 3 2 align A.X, A.Y, A.Val (2,4,3), (1,1,1), 2 (2,4,3), (2,4,2), 4 PPDP July 26, 2010 B.Y, B.Z, B.Val (1,1,1), (6,8,2), 1 (2,4,2), (6,8,2), 3 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 15 SDP for Join by Example answer(X,Z) :- a(X,Y), b(Y,Z). A.X, A.Y, A.Val (2,4,3), (1,1,1), 2 (2,4,3), (2,4,2), 4 B.Y, B.Z, B.Val (1,1,1), (6,8,2), 1 (2,4,2), (6,8,2), 3 Result size of middle(X,Y,Z) can be estimated as min(A.Y.size,B.Y.size) × (A.Val/A.Y.size) × (B.Val/B.Y.size) Examples: size(middle((2,4,3),(1,1,1),(6,8,2))) ~ min(1,1) × (2/1) × (1/1) size(middle((2,4,3),(2,4,2),(6,8,2))) ~ min(2,2) × (4/2) × (3/2) PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 16 SDP for Join by Example answer(X,Z) :- a(X,Y), b(Y,Z). Examples: Duplicates! middle((2,4,3),(1,1,1),(6,8,2)) answer((2,4,3),(6,8,2)) middle((2,4,3),(2,4,2),(6,8,2)) answer((2,4,3),(6,8,2)) Three duplicate handling approaches Sum: no duplicate removal Max: most aggressive removal Expected sum: remove “expected” number of duplicates PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 17 SDP for Recursive Predicates Recursive predicates are computed incrementally until they reach approximate fixed points Size reaches α-approximate fixed point if Δ(size)/size ≤ α where Δ(…) is the difference between two consecutive iterations in fixed point computation 0≤α≤1 PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 18 Example: Recursive Predicates Transitive closure path(X,Y) :- edge(X,Y). path(X,Y) :- edge(X,Z), path(Z,Y). (base) (rec) Computation of the estimate: 1. Compute size(path) and DM(path) using rule base 2. Compute size(path) and DM(path) using rule rec as in the case of a join 3. If size(path) reaches approximate fixed points, stop; Otherwise, go to step 2 PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 19 Experimental Studies Test programs: Transitive closure General same generation Datasets: generated with Thomas Process and Matern Cluster Process Results SDP estimates converge to real sizes for recursive predicates Expected sum is good for duplicate removal Details in the paper PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 20 Experimental Studies SDP estimates converge to real sizes for recursive predicates Transitive Closure PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 21 Experimental Studies Expected sum is good for duplicate removal Transitive Closure PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 22 Conclusion Dependency matrix for binary predicates Overcomes problems with argument independence assumption SDP for selection, join, and recursion Experimental validations PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 23 Future works More complex recursions Negation Extending SDP to n-ary predicates Apply cost-based optimization in deductive systems, such as XSB PPDP July 26, 2010 “Deriving Predicate Statistics in Datalog" by Senlin Liang and Michael Kifer 24