Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.

Download Report

Transcript Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.

Minimal Probing:
Supporting Expensive Predicates for
Top-k Queries
Kevin C. Chang
Seung-won Hwang
Univ. of Illinois at Urbana-Champaign
Context: Top-k Queries


Ranked queries return top-k results, unlike Boolean
Crucial for retrieving data by “soft” conditions
–
–
–

relevance: e.g., text search engines
similarity: e.g., multimedia databases
preference: e.g., e-commerce product search
Example scenario: preference query for finding house:
–
select h.id from house h
where new(age), cheap(price, size), large(size)
order by min(new,cheap,large) stop after 5
predicate
scoring function
k: retrieval size
Observation: Crucial to support expensive predicates
Problem: Expensive Predicates

Expensive predicates
–
–

no pre-computed indexes for zero-time sorted-access
need a probe to evaluate each object (similar to sequential scan)
Unified abstraction for:
–
–
–
user-defined functions: functional extensibility
 query conditions can be arbitrary, user-specific
 e.g., cheap(price,size)
external predicates: data extensibility
 source interface may require one probe per object
 e.g., safe(zip) access crime rate from apbnews.com
fuzzy joins
 associations of relations can be arbitrary
 e.g., close(house.zip, park.zip)
Current Limitations: “Sort-Merge” Framework

Require sorted access of search predicates.
Top-k output
Merge step
Sort step
new (search predicate)
F = min(new,cheap,large)
k=1
b:0.78
Merge
Algorithm
a:0.90, b:0.80, c:0.70, d:0.60, e:0.50
cheap (expensive predicate)


  
d:0.90, a:0.85, b:0.78, c:0.75, e:0.70
large (expensive predicate)
b:0.90, d:0.90, e:0.80, a:0.75, c:0.20

To “simulate” sorted access, require complete probing
–

are these probes necessary?
Goal: Minimize probe cost
Motivation: Solution Space

Assume sequential probing:
Algorithm skeleton:
do:
schedule next obj o, pred p
probe pr(o,p)
object
until (top-k identified)
a
b
c
predicates
p1
p2
p3
Our framework:
Separate, Global Predicate Scheduling
Two important decisions on framework:
 Separate predicate scheduling
–
–

scheduling as separate “optimization” phase before probing
avoid run-time scheduling overhead
Global predicate scheduling
–
–
–
scheduling based on global info (predicate selectivities)
lack of per-object information to justify per-object scheduling
avoid per-object scheduling overhead
Simple framework and algorithm
–
–
–
and efficient!
allow essentially A* framework, for given predicate schedule
enable formal analysis: optimality, scalability
Simple Framework

Separate, global predicate scheduling
predicates H=(p1,p2,p3)
Algorithm skeleton:
find global schedule H
do:
schedule next obj o
probe pr(o, next(o,H))
until (top-k identified)
object
a
b
c
p1
p2
p3
Challenges for Minimizing Probing

Predicate scheduling before probing
–

how to identify the best H?
Object scheduling during probing
–
how to find next object to probe, for achieving “minimal
probing” with respect to H?
Algorithm skeleton:
?
find global schedule H
do:
schedule next obj o
probe pr(o, next(o,H))
until (top-k identified)
?
Challenge 1: Object Scheduling

Goal: Perform only necessary probes

Necessary probes:
–
A probe is necessary if top-k answers cannot be determined by any
algorithm without it, regardless of the outcomes of other probes.
Question 1: Given a probe pr(o, next(o,H)), how to
determine if it is necessary?

Probe-optimal algorithm
–
An algorithm is probe-optimal if it performs only the necessary
probes.
Question 2: How to identify necessary probes in order to
design such an algorithm?
Question 1: Is this Probe Necessary?

k=1, F=min(x,p1,p2); suppose H=(p1,p2)
OID
x
p1
p2
F=min(x,p1,p2)
a
0.9
1
?
1
0.9
b
0.8
Maybe
Not!  0.8
?
c
0.7
1
?
1
0.7
d
0.6
1
?
1
0.6
e
0.5
1
?
1
0.5
top 1
Question 1: Is this Probe Necessary?

k=1, F=min(x,p1,p2); suppose H=(p1,p2)
OID
x
p1
p2
F=min(x,p1,p2)
a
0.9
 0.9
Necessary!
?
b
0.8
1
1
0.8
c
0.7
1
1
0.7
d
0.6
1
1
0.6
top 1?
e
0.5
1
1
0.5
Theorem: Probe pr(o,p) is absolutely necessary, if o is among
the current top-k in terms of ceiling scores.
Question 2: Probe-optimal object scheduling


Objects in current top-k must be further probed
Probe-optimal object scheduling: Algorithm MPro
–
use a priority queue with ceiling scores as priorities
pr(a,p1)
=0.85
a:0.9
b:0.8
c:0.7
d:0.6
e:0.5
a:0.85
b:0.8
c:0.7
d:0.6
e:0.5
pr(a,p2)
=0.75
b:0.8
a:0.75
c:0.7
d:0.6
e:0.5
pr(b,p1)
=0.78
b:0.78
a:0.75
c:0.7
d:0.6
e:0.5
pr(b,p2)
=0.90
b:0.78
a:0.75
c:0.7
d:0.6
e:0.5
top 1
b:0.78
Challenge 2: Predicate Scheduling

Scheduling problem
–

Challenges
–
–

find minimal cost schedule from permutations
selectivity estimation:
 dynamic predicates
 aggregate selectivities (context-dependent)
scheduling computation:
 NP-hard
Our approach:
–
–
on-line sampling to estimate selectivities
greedy selection to schedule predicates
0.1% sampling achieves almost the best schedule
Experiment Results

Practical performance of MPro
–
–

proportional cost to the retrieval size k
significant speedup for small k
Impact of performance factors
–
–
database size: sublinear cost scalability
score distribution and scoring function: see paper
6 hour
2 min
Demo : House Search

Data: All houses on sale in Illinois (N=20990)
–
–

from www.realtor.com.
objects: house(id, price, size, bed, bath, zip, city)
Query: F = Average(n, c, r)
– n nearcity: close to Chicago
– c cheap: “reasonable” price for its size
– r roomy: prefer 4-6 rooms
Summary of Contributions (more in the paper)

Abstraction:
–



for user-defined, external, and fuzzy join predicates
Framework and algorithm:
–
sampling-based global scheduling
–
probe-optimal algorithm MPro
–
extensions of MPro: fuzzy joins, parallel MPro, approximation
Principles/Theorems:
–
necessary-probe principle
–
probe-optimality of MPro
–
analytical scalability of MPro
Extensive experiments
Thank You!
Parallel MPro: Overview

Probe-parallel MPro
–
–

Probe k necessary probes
concurrently
Up to k-fold speedup
top-k
Data-parallel MPro
–
–
Partition data into s chunks
Up to s-time speedup
Merge
MPro
MPro
MPro
Scalability
N=1000
N=10000
N=100000
k=100
N=1000
k=1000
k=10000
N=10000 N=100000
Comparison
T
T
T
100
time(sec)
10
1
0.1
O
O
O
0.01
0
0
1
10
probe
scheduling
100
k