Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
Download ReportTranscript Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign Context: Top-k Queries Ranked queries return top-k results, unlike Boolean Crucial for retrieving data by “soft” conditions – – – relevance: e.g., text search engines similarity: e.g., multimedia databases preference: e.g., e-commerce product search Example scenario: preference query for finding house: – select h.id from house h where new(age), cheap(price, size), large(size) order by min(new,cheap,large) stop after 5 predicate scoring function k: retrieval size Observation: Crucial to support expensive predicates Problem: Expensive Predicates Expensive predicates – – no pre-computed indexes for zero-time sorted-access need a probe to evaluate each object (similar to sequential scan) Unified abstraction for: – – – user-defined functions: functional extensibility query conditions can be arbitrary, user-specific e.g., cheap(price,size) external predicates: data extensibility source interface may require one probe per object e.g., safe(zip) access crime rate from apbnews.com fuzzy joins associations of relations can be arbitrary e.g., close(house.zip, park.zip) Current Limitations: “Sort-Merge” Framework Require sorted access of search predicates. Top-k output Merge step Sort step new (search predicate) F = min(new,cheap,large) k=1 b:0.78 Merge Algorithm a:0.90, b:0.80, c:0.70, d:0.60, e:0.50 cheap (expensive predicate) d:0.90, a:0.85, b:0.78, c:0.75, e:0.70 large (expensive predicate) b:0.90, d:0.90, e:0.80, a:0.75, c:0.20 To “simulate” sorted access, require complete probing – are these probes necessary? Goal: Minimize probe cost Motivation: Solution Space Assume sequential probing: Algorithm skeleton: do: schedule next obj o, pred p probe pr(o,p) object until (top-k identified) a b c predicates p1 p2 p3 Our framework: Separate, Global Predicate Scheduling Two important decisions on framework: Separate predicate scheduling – – scheduling as separate “optimization” phase before probing avoid run-time scheduling overhead Global predicate scheduling – – – scheduling based on global info (predicate selectivities) lack of per-object information to justify per-object scheduling avoid per-object scheduling overhead Simple framework and algorithm – – – and efficient! allow essentially A* framework, for given predicate schedule enable formal analysis: optimality, scalability Simple Framework Separate, global predicate scheduling predicates H=(p1,p2,p3) Algorithm skeleton: find global schedule H do: schedule next obj o probe pr(o, next(o,H)) until (top-k identified) object a b c p1 p2 p3 Challenges for Minimizing Probing Predicate scheduling before probing – how to identify the best H? Object scheduling during probing – how to find next object to probe, for achieving “minimal probing” with respect to H? Algorithm skeleton: ? find global schedule H do: schedule next obj o probe pr(o, next(o,H)) until (top-k identified) ? Challenge 1: Object Scheduling Goal: Perform only necessary probes Necessary probes: – A probe is necessary if top-k answers cannot be determined by any algorithm without it, regardless of the outcomes of other probes. Question 1: Given a probe pr(o, next(o,H)), how to determine if it is necessary? Probe-optimal algorithm – An algorithm is probe-optimal if it performs only the necessary probes. Question 2: How to identify necessary probes in order to design such an algorithm? Question 1: Is this Probe Necessary? k=1, F=min(x,p1,p2); suppose H=(p1,p2) OID x p1 p2 F=min(x,p1,p2) a 0.9 1 ? 1 0.9 b 0.8 Maybe Not! 0.8 ? c 0.7 1 ? 1 0.7 d 0.6 1 ? 1 0.6 e 0.5 1 ? 1 0.5 top 1 Question 1: Is this Probe Necessary? k=1, F=min(x,p1,p2); suppose H=(p1,p2) OID x p1 p2 F=min(x,p1,p2) a 0.9 0.9 Necessary! ? b 0.8 1 1 0.8 c 0.7 1 1 0.7 d 0.6 1 1 0.6 top 1? e 0.5 1 1 0.5 Theorem: Probe pr(o,p) is absolutely necessary, if o is among the current top-k in terms of ceiling scores. Question 2: Probe-optimal object scheduling Objects in current top-k must be further probed Probe-optimal object scheduling: Algorithm MPro – use a priority queue with ceiling scores as priorities pr(a,p1) =0.85 a:0.9 b:0.8 c:0.7 d:0.6 e:0.5 a:0.85 b:0.8 c:0.7 d:0.6 e:0.5 pr(a,p2) =0.75 b:0.8 a:0.75 c:0.7 d:0.6 e:0.5 pr(b,p1) =0.78 b:0.78 a:0.75 c:0.7 d:0.6 e:0.5 pr(b,p2) =0.90 b:0.78 a:0.75 c:0.7 d:0.6 e:0.5 top 1 b:0.78 Challenge 2: Predicate Scheduling Scheduling problem – Challenges – – find minimal cost schedule from permutations selectivity estimation: dynamic predicates aggregate selectivities (context-dependent) scheduling computation: NP-hard Our approach: – – on-line sampling to estimate selectivities greedy selection to schedule predicates 0.1% sampling achieves almost the best schedule Experiment Results Practical performance of MPro – – proportional cost to the retrieval size k significant speedup for small k Impact of performance factors – – database size: sublinear cost scalability score distribution and scoring function: see paper 6 hour 2 min Demo : House Search Data: All houses on sale in Illinois (N=20990) – – from www.realtor.com. objects: house(id, price, size, bed, bath, zip, city) Query: F = Average(n, c, r) – n nearcity: close to Chicago – c cheap: “reasonable” price for its size – r roomy: prefer 4-6 rooms Summary of Contributions (more in the paper) Abstraction: – for user-defined, external, and fuzzy join predicates Framework and algorithm: – sampling-based global scheduling – probe-optimal algorithm MPro – extensions of MPro: fuzzy joins, parallel MPro, approximation Principles/Theorems: – necessary-probe principle – probe-optimality of MPro – analytical scalability of MPro Extensive experiments Thank You! Parallel MPro: Overview Probe-parallel MPro – – Probe k necessary probes concurrently Up to k-fold speedup top-k Data-parallel MPro – – Partition data into s chunks Up to s-time speedup Merge MPro MPro MPro Scalability N=1000 N=10000 N=100000 k=100 N=1000 k=1000 k=10000 N=10000 N=100000 Comparison T T T 100 time(sec) 10 1 0.1 O O O 0.01 0 0 1 10 probe scheduling 100 k