#### Transcript Lightweight Graphical Models for Selectivity Estimation

Lightweight Graphical Models for
Selectivity Estimation without
Independence Assumptions
Kostas Tzoumas
Amol Deshpande
Christian S. Jensen
Query Optimization
• Need to decide (among others) the
“best” join plan.
• Very simplified picture:
– Cost of a plan = # of intermediate tuples it
produces.
– Use upper plan if |LO|<|OC|, lower plan
otherwise.
• Use size estimates of intermediate
relations.
• Cardinality estimation: Estimating
relation sizes at compile time.
LO
L
cost = |LO|
O
cost = |OC|
C
OC
– Typically using statistical summaries.
• Errors in estimates can lead to wrong
plan. Independence assumptions a
frequent factor of errors.
L
O
C
2
Why Correlations Matter
from
lineitem,orders,customer
where l_orderkey=o_orderkey and
o_custkey=c_custkey and
o_totalprice in [t1,t2] and
l_extendedprice in [e1,e2] and
c_acctbal in [b1,b2]
An order’s total price is a function
of the prices of the order’s items.
Causes LO to have more tuples
than in average.
lineitem
orders
eprice orderkey
orderkey tprice custkey
customer
custkey acctbal
3
Independence Assumption
from
lineitem,orders,customer
where l_orderkey=o_orderkey and
o_custkey=c_custkey and
o_totalprice in [t1,t2] and
l_extendedprice in [e1,e2]
and c_acctbal in [b1,b2]
NLJ
L’O’
HJ
L’
O’
C’
σL
σΟ
σC
|L’| = Pr(eprice in [e1,e2]) |L|
|O’| = Pr(tprice in [t1,t2]) |O|
|L’O’| = Pr(l_orderkey=o_orderkey)|L’||O’|
|L’O’| = Pr(l_orderkey=o_orderkey)
Pr(eprice in [e1,e2])
Pr(tprice in [t1,t2]) |L||O|
• Results to under-estimation of |L’O’| 
wrong query plan, nested loop join.
• Can result in orders of magnitude slower
execution.
• Solution: estimate joint probabilities:
|L’O’| = Pr (l_orderkey=o_orderkey,
eprice in [e1,e2],
tprice in [t1,t2]) |L||O|
L
O
C
4
Problem Setting
lineitem
JLO ≡
L.okey=O.okey
sdate
cdate
rdate
orders
odate
Problem:
J ≡
J ≡
J ≡
Exponential
blowL.pkey=P.pkey
L.skey=S.skey O.ckey=C.ckey
customer
part
up ofsupplier
storage
space
J ≡
S.nkey=C.nkey
and
size
acctbalselectivity
acctbal
estimation time.
LP
OC
LS
SC
Join indicator
A binary random variable
Captures the “event” that
two random tuples join.
We don’t need “key”
attributes in the model
(hard to approximate).
Pr(JOC=T)=sel(O.ckey=C.ckey)
• Database + workload  schema graph
• Schema graph  Set of random variables (descriptive
attributes and join indicators)
• Goal: Approximate P(JLO, JLP, JLS, JSC, JOC, L.sdate, L.cdate,
5
L.rdate, O.odate, P.size, S.acctbal,C.acctbal)
Graphical Models
• Exploit independence and conditional independence
to factor the full joint distribution.
– X┴Y ↔P(X,Y)=P(X)P(Y)
– X┴Z|Y ↔ P(X,Y,Z)=P(X,Y)P(Y,Z)/P(Y)
X
Y
Z
• Bayesian networks
– Graphical representation of a set of cond. inds
– Express a factorization of the joint distribution
• Can be used directly for selectivity estimation
[Getoor’01].
– High dimensional distributions  high overhead
– Construction algorithms do not scale.
6
Fixed Structure
• “Tailored” solution
– 2D histograms only, scalable construction
• Relational independencies:
– L.rdate ┴ O.odate
σL.rdate=a,O.odate=b(LxO)=σL.rdate=a(L) x σO.odate=b(O)
– L.rdate ┴ JSC
– JLO ┴ JSC
• Design decisions
– Join indicator has at most two parents
– BN within a relation a directed tree
7
Bayesian Network
lineitem
rdate
cdate
sdate
part
orders
odate
JLO
JLP
JLS
size
S.acctbal
supplier
JOC
JSC
S.acctbal
customer
8
Moral Graph
rdate
cdate
sdate
odate
JLO
JLP
JLS
size
S.acctbal
JOC
JSC
C.acctbal
9
Chordal Graph
rdate
cdate
sdate
odate
JLO
JLP
JLS
size
S.acctbal
JOC
JSC
C.acctbal
10
Junction Tree
Distributions to be kept =
marginals of “cliques” and
“separators.”
Factorization achieved!
sdate
odate
C.acctbal
odate
sdate
sdate
cdate
odate
sdate
cdate
odate
sdate
JLO
sdate
S.acctbal
JLS
acctbal
sdate
rdate
odate
Efficient selectivity estimation
via junction tree propagation,
or a cdate
custom dynamic odate
odate
C.acctbal
programming
algorithm.
Storage space:
sdate
sdate
• (sdate,rdate)  One 2D histogram
• C.acctbal
(odate,acctbal,JOC)  Two 2D histogramssdate
(JOC=false,true)
size
S.acctbal
• (acctbal,odate,sdate)

Three
1D
histograms
JLP
sdate
Only 2D histograms needed!
JOC
S.acctbal
C.acctbal
JSC
11
Scalable Model Construction
• Create a “local” Bayesian network for each relation R(X,Y,…) by
testing pairwise correlations.
– P(X,Y) = select X,Y,count(*) from R group by X,Y
– Extract the maximum spanning tree using mutual
information I(X;Y) as weight.
• Find the best two predictors of each join indicator JRS.
– P(X,Y,JRS=T) = select
R.X,S.Y,count(*)
from R,S
where R.a=S.a
group by R.X,S.Y
– P(X,Y,JRS=F) = P(X)P(Y) – P(X,Y,JRS=T)
• Very efficient; same complexity as CORDS [Ilyas’04].
12
Selectivity Estimation
sdate
Equivalent: Estimate
Pr(JLO=true,
C.acctbal
JOC=true,
S.acctbal
sdate<=“25/7/2011”,
sdate
C.acctbal<=200000)
sdate
cdate
odate
JLO
sdate
S.acctbal
JLS
odate
sdate
cdate
odate
sdate
C.acctbal
odate
sdate
sdate
odate
from
lineitem,orders,customer
sdatel_orderkey=o_orderkey and
where
rdateo_custkey=c_custkey and
l_sdate<=“25/7/2011” and cdate
sdate
c_acctbal<=200000
odate
odate
C.acctbal
JOC
acctbal
Estimate size of query:
S.acctbal
C.acctbal
JSC
Extract “Steiner tree”:
sdate Minimal subtree that
size
contains JLO, JOC, sdate,
JLP
13
C.acctbal
φ1*(cdate,odate)=
φ1[sdate<=“25/7/2011”]
φ2*(cdate,odate)=
φ2[JLO=true]
φ3*(odate)=
φ3[JOC=true,acctbal<=200000]
2. Multiply
φ12*(cdate,odate)=
φ1*φ2*/μ12
3. Marginalize
φ12**(odate)=
Σcdateφ12*
4. Multiply
φ123*(odate)=
φ12**φ3*/μ23
5. Return
Σodateφ123*
φ2=P(cdate,odate,JLO)
cdate
odate
cdate
odate
JLO
μ23=P(odate)
odate
1. Substitute
Selectivity Estimation
odate
acctbal
JOC
φ3=P(odate,acctbal,JOC)
μ12=P(cdate,odate)
sdate
cdate
odate
φ1=P(sdate,cdate,odate)
Estimate
Pr(JLO=true,
JOC=true,
sdate<=“25/7/2011”,
acctbal<=200000)
Also in paper: dynamic programming
algorithm that minimizes #multiplications by
14
exploiting query optimizer order of requests.
Implementation
• Model construction outside DBMS. Create distributions
with SQL queries.
– Stores junction tree as tables in PostgreSQL catalog.
• Graphical model foundation as PostgreSQL “nodes”.
– Probability distributions stored as equi-width multidimensional
histograms.
– Cliques, multiplication, division, marginalization.
– Junction tree, selectivity estimation algorithms
• Integration with PostgreSQL query optimizer.
– Load Steiner tree from catalog tables.
– Bypass PostgreSQL selectivity estimation functions.
15
Impact of Capturing Correlations
Can do selectivity
estimation efficiently.
Optimization time in
the range of 10s of
milliseconds
Execution & optimization times
from
lineitem,orders,customer
where l_orderkey=o_orderkey and o_custkey=c_custkey
and o_tprice in [t1,t2] and l_eprice in [e1,e2]
and c_acctbal in [b1,b2]
Cost of plans (intermediate tuples)
Avoid under-estimation
by capturing the price
correlation.
Estimates very close to
reality.
Optimizer picks different
plan than default PGSQL.
Huge impact in
execution time.
16
Accuracy of Estimates
Multiplicative error:
max(real,estimate) / min(real,estimate)
Penalizes both under- and over- estimation.
Subset of TPC-H schema.
Tweaked data generator
to introduce correlations.
400 random queries.
Geometric average of
multiplicative error.
One order of magnitude
better estimates for 5join queries
Optimization time less
than 25 milliseconds
17
Conclusions and Future Work
• Attribute value independence assumption unfounded
– Heuristic, not good enough
– Most frequent source of horrible plans
• Practical adaptation of graphical models implemented
in PostgreSQL
– Low overhead, good estimate quality
• Specialized synopses
– Low-dimensional, minimize multiplicative error, error
guarantees after multiplication