Transcript Talk Slides

Factorizing Complex Predicates
in Queries to Exploit Indexes
Prasanna Ganesan*
Stanford University
Surajit Chaudhuri
Microsoft Research
Sunita Sarawagi*
IIT Bombay
*Work done at Microsoft Research
Motivation
• Complex, redundant WHERE clauses
– Application-generated decision-support queries
• May result in unsatisfactory plans
– So many candidate plans, so little time
– Conversion to normal form doesn’t work
– Often end up with a table scan
• Goal: Techniques for efficiently identifying
access paths for such complex WHERE clauses
Outline
• Why is the problem challenging?
–
–
–
–
CNF or DNF does not avoid redundancy
Plan space is large
Formal problem statement
Challenges
• Factorization
– Basic factorization: “largest common conjunctive
factor”
– Factorization involving union
• Approximate Factorization
• Experiments
Basic Primitives(Index Intersection)
• SELECT addr FROM consumers
WHERE (income>100000) AND (zipcode=94305)
Lookup(addr)
Index intersect ()
Seek(income>100000)
Seek(zipcode=94305)
Basic Primitives(Index Union)
• SELECT addr FROM consumers
WHERE (income>100000) OR (zipcode=94301)
A
B
Lookup(addr)
Index union()
Seek(income>100000)
A
Seek(zipcode=94301)
B
Index Intersection and Union
• (A AND B) OR (A AND C)  AB+AC
 A(B+C)
Data Lookup


Seek(A)



Seek(B)
Seek(C)
Seek(A)
Index Intersection and Union
• (A OR B) AND (A OR C)  (A+B)(A+C)
 A+BC
Data Lookup


Seek(A)



Seek(B)
Seek(C)
Seek(A)
The Problem
• Given a relation R, find a plan to retrieve
πS(σP(R)) using one or more of
– Table scan
– Index seeks/scans
– Index intersections
– Index unions
– Data lookup from RID lists
• Focus on single-table selection
– Naturally extends to arbitrary queries
Challenges
• Understanding the set of all feasible plans
– Many equivalent rewritings
– Can we rewrite to retrieve a superset?
• Identifying the “best” plan
– Different index characteristics
• Impacts access cost
– Different selectivities
• Impacts intersection/union costs as well
Roadmap
Arbitrary
Expn.
Format
CNF
DNF
Intersection
+One Union
Plan Complexity
Arbitrary
Basic Factorization
{A,C,D}
AND
{A,C}
OR
{A,B,C} AND
A
B
OR
{A,C} AND
C
A
{D,E} AND
C
D
{D}
{D,F} AND
E
D
F
Basic Factorization(Contd.)
• We now have a conjunctive factor
(A AND C AND D)
• Use standard optimizer module to find
plan for this factor
– Table scan, index seek or index intersection
– Typically a greedy algorithm based on index
costs and selectivity
• Evaluate remaining conditions as a filter
Introducing the Union
• If query has conjunctive factors, simple
factorization usually suffices
• Many queries don’t have such factors
– Need to explore index unions
• Consider plans with at most one union
– No index intersection above it
– Sufficient for large set of practical queries
– Limited space allows optimal algorithms
Single-Union Plans ---(1)
• Assume expression in Disjunctive Normal
AC A
Form(DNF)
DG
• E.g. E = ABC+ACD+ADG+DGH
• Consider factorizing E as f.Q+R
– Find intersection plan for f
– Recursively find single-union plan for R
– Merge the two plans (re-use R’s union if it
exists)
Single-Union Plans---(2)
• E=ABC+ACD+ADG+DGH = f.Q+R
• Say f=AC. E=AC(B+D)+ADG+DGH
• Recursively factorize R into DG(A+H)
R
Q
Lookup(Filter E)

Lookup(Filter A+H)
Lookup(Filter C(B+D))
Seek(A)

Seek(D)
Seek(G)
Single-Union Plans ---(3)
• Cost(E)~minE=f.Q+R ( cost(f.Q)+cost(R))
– Natural dynamic-programming formulation
– Real equation slightly more complex
– Cost is exponential
• Use a greedy alternative
– Choose the f that provides greatest cost
reduction without further factorizing R
Other Expression Forms
• Conjunctive Normal Form(CNF)
– Can just use one term
– Multiply terms E.g. (A+B)(A+C) => A+BC
– Recursive algorithm in paper
• General AND-OR trees
– Bottom-up algorithm
– Applies DNF algorithm to OR nodes and CNF
algorithm to AND nodes
Approximate Factoring
• Often, predicates are similar but not
identical
– A(X BETWEEN 1 AND 100) + B(X BETWEEN
10 AND 110)
– Like to exploit similarity of X predicates
• Relax both X predicates to (X BETWEEN
1 AND 110)
– Resulting query is more general (assuming no
NOTty problems)
Challenge
• What predicates do we relax?
– Trade-off between factoring benefit and cost
of false positives
• Rule 1 of relaxation is:
Don’t
irrelevant
predicates
– We
dorelax
not talk
about Fight
Club.
• Find “best” set of range predicates to relax
for each attribute
– Then select the “best” attribute
Finding predicates to relax
• Given expression with range predicates
involving attribute X.
– Find which predicates to relax for greatest
plan improvement.
• Turns out a greedy algorithm is optimal for
many cost functions
– Proof in paper appendix
– Useful as a heuristic even otherwise
Key Idea
• Relax a pair of predicates if computed to
be beneficial
• Repeat treating the relaxed query as the
original query
• Trick is in figuring out when a relaxation is
beneficial
– Original predicates are treated slightly
differently from relaxed predicates
– Details in paper
Experiments
• Experiments on SQL Server 2000
• Factorizing done in stand-alone module
– Did I hear someone say SQL is declarative?
• Queries on UCI Machine Learning and
UCI KDD data.
– Table sizes ~ 1 million rows
• 15 workloads
– Mostly DNF queries (#terms:1 to >100)
Pi
m
a
Pi
m
a2
So
S
yb
hu
ttl
So ea
yb n_L e
ea
a
n_ rge
La
rg
e2
An
ne
Au alU
Ba
s
la tral
nc
ia
e_ n
Sc
al
e
Br
ea
Br st
ea
st
D
ia 2
b
D ete
ia
b s
H ete
yp
s
ot 2
hy
ro
id
L
M
e
us tte
hr r
oo
om
% Reduction in Running Time
Reduction in Running Time
80
70
60
50
40
%Incremental
Reduction(Approx)
%Reduction(Exact)
30
20
10
0
Impact of Factorization
50
45
40
% of Queries
35
30
25
20
15
10
5
0
Total Changed
Plans
Changes from
Changes from
Exact factoring Approx. factoring
Indexintersection
plans
IIU plans
Related Work
• Optimization of complex WHERE clause
– Convert to CNF/DNF [Selinger79, Dayal87]
– Using multiple indexes [Mohan90]
• No factorization
– Using smarter indexes [Leslie95]
• Factorization a popular idea in other
domains
– Compilers [Reinwald66], VLSI Design
[Brayton87]
Conclusion
• Our contributions
– Using factorization to optimize queries
• Efficient algorithms requiring no normalization
• Staged to reduce compile-time overhead
– Introduced approximate factoring
• Algorithm for optimal relaxation
– Integration into overall optimization framework