Transcript Document

The Query Compiler
16.1 Parsing and Preprocessing
Meghna Jain(205)
Dr. T. Y. Lin
Presentation Outline
16.1 Parsing and Preprocessing
16.1.1 Syntax Analysis and Parse Tree
16.1.2 A Grammar for Simple Subset of SQL
16.1.3 The Preprocessor
16.1.4 Processing Queries Involving Views
Query compilation is divided
into three steps
1. Parsing: Parse SQL query into parser tree.
2. Logical query plan: Transforms parse tree into expression
tree of relational algebra.
3.Physical query plan:
Transforms logical query plan into
physical query plan.
. Operation performed
. Order of operation
. Algorithm used
. The way in which stored data is obtained and passed from one
operation to another.
Query
Parser
Preprocessor
Logical Query plan
generator
Query rewrite
Preferred logical
query plan
Form a query to a logical query plan
Syntax Analysis and Parse Tree
Parser takes the sql query and convert it to parse
tree. Nodes of parse tree:
1. Atoms: known as Lexical elements such as key
words, constants, parentheses, operators, and
other schema elements.
2. Syntactic categories: Subparts that plays a
similar role in a query as <Query> , <Condition>
Grammar for Simple Subset of SQL
<Query> ::= <SFW>
<Query> ::= (<Query>)
<SFW> ::= SELECT <SelList> FROM <FromList> WHERE <Condition>
<SelList> ::= <Attribute>,<SelList>
<SelList> ::= <Attribute>
<FromList> ::= <Relation>, <FromList>
<FromList> ::= <Relation>
<Condition> ::= <Condition> AND <Condition>
<Condition> ::= <Tuple> IN <Query>
<Condition> ::= <Attribute> = <Attribute>
<Condition> ::= <Attribute> LIKE <Pattern>
<Tuple> ::= <Attribute>
Atoms(constants), <syntactic categories>(variable),
::= (can be expressed/defined as)
Query and Parse Tree
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate)
Query:
Give titles of movies that have at least one star born in 1960
SELECT title FROM StarsIn WHERE starName IN
(
SELECT name FROM MovieStar WHERE
birthdate LIKE '%1960%'
);
Another query equivalent
SELECT title
FROM StarsIn, MovieStar
WHERE starName = name AND
birthdate LIKE '%1960%' ;
Parse Tree
<Query>
<SFW>
SELECT <SelList> FROM
<Attribute>
<FromList>
WHERE
<RelName> , <FromList>
title
StarsIn
<Condition>
AND
<RelName>
MovieStar
<Query>
<Condition>
<Attribute>
starName
=
<Attribute>
name
<Condition>
<Attribute> LIKE <Pattern>
birthdate
‘%1960’
The Preprocessor
Functions of Preprocessor
. If a relation used in the query is virtual view then each use of this
relation in the form-list must replace by parser tree that describe the
view.
. It is also responsible for semantic checking
1. Checks relation uses : Every relation mentioned in FROMclause must be a relation or a view in current schema.
2. Check and resolve attribute uses: Every attribute mentioned
in SELECT or WHERE clause must be an attribute of same
relation in the current scope.
3. Check types: All attributes must be of a type appropriate to
their uses.
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate)
Query:
Give titles of movies that have at least one star born in 1960
SELECT title FROM StarsIn WHERE starName IN
(
SELECT name FROM MovieStar WHERE
birthdate LIKE '%1960%'
);
Preprocessing Queries Involving
Views
When an operand in a query is a virtual view, the preprocessor needs
to replace the operand by a piece of parse tree that represents how
the view is constructed from base table.
Base Table: Movies( title, year, length, genre, studioname,
producerC#)
View definition : CREATE VIEW ParamountMovies AS
SELECT title, year FROM movies
WHERE studioName = 'Paramount';
Example based on view:
SELECT title FROM ParamountMovies WHERE year = 1979;
Query Compiler
[Figure 16.16 & Figure16.18]
CS 257 [section 1]
Aakanksha Pendse.
Roll number : 127
Problem Statement
To verify the claim given in the figure 16.18
using the figure 16.16
What is figure 16.16?

The tree in this diagram represents the
following query.
◦

Database tables with its attributes are:
◦
◦

“find movies with movie-stars born in 1960”
StarsIn(movieTitle, movieYear, starName)
MovieStar(name, address, gender, birthdate)
SQL Query:
SELECT movieTitle
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’
);



This diagram (fig.16.14) is derived from figure
16.14
This is done by applying the Rule which
handles the two-argument selection with a
condition involving IN
In this query the sub-query is uncorrelated.
Figure 16.14
Figure 16.16 : applying the
rule for IN condition
Rule
What is figure 16.18?

The tree in this diagram represents the
following query.
◦

“Find the movies where the average age of the stars
was at most 40 when the movie was made.”
SQL Query :
SELECT DISTINCT m1.movieTitle, m1.movieYear
FROM StarsIn m1 WHERE m1.movieYear – 40 <=
(SELECT AVG(birthdate) FROM StarsIn m2, MovieStar
s WHERE m2.starName = s.name AND m1.movieTitle =
m2.movieTitle AND m1.movieYear = m2.movieYear);

This is a co-related sub-query.
Rules for translating a co-related
sub-query to relational algebra.





Correlated sub queries contains unknown values defined
outside themselves.
Because of this reason, co-related sub-queries cannot be
translated in isolation.
These types of sub queries need to be translated so that they
produce a relation in which certain extra attributes appear.
These attributes must later be compared with the externally
defined attributes.
Conditions that relate attributes from the sub query to attributes
outside are then applied to this relation. The extra attributes
which are not necessary, can be projected out.
In this strategy, care should be taken of not forming duplicate
tuples at the end.
Steps of formation of tree in
figure 16.18



The tree in figure 16.18 is formed by parsing of
the query and partial translation to relational
algebra.
The WHERE-clause of the sub query is split
into two. It is used to covert the product of
relations to an equijoin
The aliases m1, m2 and s are made the nodes of
the tree.
Figure 16.18 : partially
transformed parse tree
16.2 ALGEBRAIC LAWS FOR
IMPROVING QUERY PLANS
Ramya Karri
ID: 206
Optimizing the Logical Query
Plan



The translation rules converting a parse tree to a logical query
tree do not always produce the best logical query tree.
It is often possible to optimize the logical query tree by
applying relational algebra laws to convert the original tree into
a more efficient logical query tree.
Optimizing a logical query tree using relational algebra laws is
called heuristic optimization
Relational Algebra Laws
These laws often involve the properties of:

commutativity - operator can be applied to operands independent
of order.


E.g. A + B = B + A - The “+” operator is commutative.
associativity - operator is independent of operand grouping.

E.g. A + (B + C) = (A + B) + C - The “+” operator is
associative.
Associative and Commutative
Operators

The relational algebra operators of cross-product (×), join
(⋈), union, and intersection are all associative and
commutative.
Commutative
Associative
R X S=S X R
(R X S) X T = S X (R X T)
R⋈S=S⋈R
(R ⋈ S) ⋈ T= S ⋈ (R ⋈ T)
RS=SR
(R  S)  T = S  (R  T)
R ∩S =S∩ R
(R ∩ S) ∩ T = S ∩ (R ∩ T)
Laws Involving Selection
Complex selections involving AND or OR can be broken into two or
more selections: (splitting laws)

σC1 AND C2 (R) = σC1( σC2 (R))
σC1 OR C2 (R) = ( σC1 (R) ) S ( σC2 (R) )
Example


R={a,a,b,b,b,c}

p1 satisfied by a,b, p2 satisfied by b,c




σp1vp2 (R) = {a,a,b,b,b,c}
σp1(R) = {a,a,b,b,b}
σp2(R) = {b,b,b,c}
σp1 (R) U σp2 (R) = {a,a,b,b,b,c}
Laws Involving Selection
(Contd..)

Selection is pushed through both arguments
for union:
σC(R  S) = σC(R)  σC(S)

Selection is pushed to the first argument and
optionally the second for difference:
σC(R - S) = σC(R) - S
σC(R - S) = σC(R) - σC(S)
Laws Involving Selection
(Contd..)


All other operators require selection to be pushed to only
one of the arguments.
For joins, may not be able to push selection to both if
argument does not have attributes selection requires.
σC(R × S) = σC(R) × S
σC(R ∩ S) = σC(R) ∩ S
σC(R ⋈ S) = σC(R) ⋈ S
σC(R ⋈D S) = σC(R) ⋈D S
Laws Involving Selection
(Contd..)






Example
Consider relations R(a,b) and S(b,c) and the
expression
σ (a=1 OR a=3) AND b<c (R ⋈S)
σ a=1 OR a=3(σ b<c (R ⋈S))
σ a=1 OR a=3(R ⋈ σ b<c (S))
σ a=1 OR a=3(R) ⋈ σ b<c (S)
Laws Involving Projection

Like selections, it is also possible to push
projections down the logical query tree.
However, the performance gained is less than
selections because projections just reduce the
number of attributes instead of reducing the
number of tuples.
Laws Involving Projection

Laws for pushing projections with joins:
πL(R × S) = πL(πM(R) × πN(S))
πL(R ⋈ S) = πL((πM(R) ⋈ πN(S))
πL(R ⋈D S) = πL((πM(R) ⋈D πN(S))
Laws Involving Projection

Laws for pushing projections with set operations.

Projection can be performed entirely before union.
πL(R UB S) = πL(R) UB πL(S)

Projection can be pushed below selection as long as we
also keep all attributes needed for the selection (M = L
 attr(C)).
πL ( σC (R)) = πL( σC (πM(R)))
Laws Involving Join

We have previously seen these important rules
about joins:
1. Joins are commutative and associative.
2. Selection can be distributed into joins.
3. Projection can be distributed into joins.
Laws Involving Duplicate
Elimination




The duplicate elimination operator (δ) can be
pushed through many operators.
R has two copies of tuples t, S has one copy of
t,
δ (RUS)=one copy of t
δ (R) U δ (S)=two copies of t
Laws Involving Duplicate
Elimination

Laws for pushing duplicate elimination
operator (δ):
δ(R × S) = δ(R) × δ(S)
δ(R S) = δ(R)
δ(S)
δ(R D S) = δ(R)
D δ(S)
δ( σC(R) = σC(δ(R))
Laws Involving Duplicate
Elimination

The duplicate elimination operator (δ) can also
be pushed through bag intersection, but not
across union, difference, or projection in
general.
δ(R ∩ S) = δ(R) ∩ δ(S)
Laws Involving Grouping


The grouping operator (γ) laws depend on the aggregate operators
used.
There is one general rule, however, that grouping subsumes duplicate
elimination:
δ(γL(R)) = γL(R)

The reason is that some aggregate functions are unaffected by
duplicates (MIN and MAX) while other functions are (SUM, COUNT,
and AVG).
Query Compiler
By:Payal Gupta
Roll No:106(225)
Professor :Tsau Young Lin
Pushing Selections


It is, replacing the left side of one of the rules
by its right side.
In pushing selections we first a selection as far
up the tree as it would go, and then push the
selections down all possible branches.




Let’s take an example:
S t a r s I n ( t i t l e , year, starName)
Movie(title, year, length, incolor, studioName,
producerC#)
Define view MoviesOf 1996 by:
CREATE VIEW MoviesOfl996 AS
SELECT *
FROM Movie
,WHERE year = 1996;

"which stars worked for which studios in
1996?“ can be given by a SQL Query:
SELECT starName, studioName
FROM MoviesOfl996 NATURAL JOIN
StarsIn;
ΠstarName,studioName
OYear=19
96
StarsI
n
Movie
Logical query plan constructed from definition of a query and vi
Improving the query plan by moving selections up and down th
ΠstarName,studioName
OYear=19
96
Movie
OYear=19
96
StarsI
n
Laws Involving Projection



"pushing" projections really involves
introducing a
new projection somewhere below an existing
projection.
projection keeps the number of tuples the
same and
only reduces the length of tuples.
To describe the transformations of extended
projection Consider a term E + x on the list
for a
projection, where E is an attribute or an
Example

Let R(a, b, c) and S(c, d, e) be two relations.
Consider the expression x,+,,,, b+y(R w S). The
input attributes of the projection are a,b, and e,
and c is the only join attribute. We may apply the
law for pushing projections below joins to get
the equivalent expression:
Πa+e->x,b->y(Πa,b,c(R)

Πc,e(S))
Eliminating this projection and getting a third

In addition, we can perform a projection
entirely before a bag union. That is:
ΠL(R UB S)= ΠL(R) )UB ΠL(S)
Laws About Joins and
Products

laws that follow directly from the definition of the
join:
O
R c
S=
c( R * S)
O

R
S = ΠL( c ( R * S) ) , where C is the
condition that equates each pair of attributes from R
and S with the same name. and L is a list that
includes one attribute from each equated pair and all
the other attributes of R and S.
Laws Involving Duplicate
Elimination



The operator δ which eliminates duplicates
from a bag can be pushed through many but not
all operators.
In general, moving a δ down the tree reduces
the size of intermediate relations and may
therefore beneficial.
Moreover, sometimes we can move δ to a
position where it can be eliminated
altogether,because it is applied to a relation that
is known not to possess duplicates.

δ (R)=R if R has no duplicates. Important cases
of such a relation R include:
a) A stored relation with a declared primary
key, and
b) A relation that is the result of a γ operation,
since grouping creates a relation with no
duplicates.






Several laws that "push" δ through other
operators are:
δ (R*S) =δ(R) * δ(S)
δ (R
S)=δ(R)
δ(S)
δ (R
c δ(S)
O c S)=δ(R)
O
δ ( c (R))= c (δ(R))
We can also move the δ to either or both of the
arguments of an intersection:
Laws Involving Grouping and
Aggregation


When we consider the operator γ, we find that
the applicability of many transformations
depends on the details of the aggregate operators
used. Thus we cannot state laws in the generality
that we used for the other operators. One
exception is that a γ absorbs a δ . Precisely:
δ(γL(R))=γL(R)


let us call an operator γ duplicate-impervious if
the only aggregations in L are MIN and/or
MAX then:
γ L(R) = γ L (δ(R)) provided γL is duplicateimpervious.
Example

Suppose we have the relations
MovieStar(name , addr , gender, birthdate)
StarsIn(movieTitle, movieyear, starname)
and we want to know for each year the
birthdate of the youngest star to appear in a
movie that year. We can express this query as:
SELECT movieyear, MAX(birth date)
FROM MovieStar, StarsIn
WHERE name = starName
γ movieYear, MAX ( birthdate )
O
name = starName
MovieStar StarsIn
Initial logical query plan for the query

Some transformations that we can apply to Fig
are
1. Combine the selection and product into an
equijoin.
2.Generate a δ below the γ , since the γ is
duplicateimpervious.
3. Generate a Π between the γ and the
introduced δ to
project onto movie-Year and birthdate, the
only
γ movieYear, MAX ( birthdate )
Π movieYear, birthdate
δ
name = starName
MovieStar StarsIn
γ movieYear, MAX ( birthdate )
Π movieYear, birthdate
name = starName
δ
Π birthdate,name Π
movieYear,starname
δ
CS 255: Database System
Principles
slides: From Parse Trees to Logical
Query Plans
By:- Arunesh Joshi
Id:-006538558
Agenda




Conversion to Relational Algebra.
Removing Sub queries From Conditions.
Improving the Logical Query Plan.
Grouping Associative/Commutative Operators.
Parsing
• Goal is to convert a text string containing a
query into a parse tree data structure:
– leaves form the text string (broken into lexical
elements)
– internal nodes are syntactic categories
• Uses standard algorithmic techniques from
compilers
– given a grammar for the language (e.g., SQL),
process the string and build the tree
Example: SQL query
SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’
);
(Find the movies with stars born in 1960)
Assume we have a simplified grammar for SQL.
Example: Parse Tree
<Query>
<SFW>
SELECT
<SelList> FROM <FromList>
<Attribute>
title
SELECT
<RelName>
<Attribute>
name
FROM
<Attribute>
starName
<SFW>
<FromList>
WHERE
<RelName>
MovieStar
<Condit
<Tuple> IN
StarsIn
<SelList>
WHERE
(
<Attribute> LIK
birthDate
The Preprocessor
• It replaces each reference to a view with a
parse (sub)-tree that describes the view (i.e., a
query)
• It does semantic checking:
– are relations and views mentioned in the schema?
– are attributes mentioned in the current scope?
– are attribute types correct?
Convert Parse Tree to
Relational Algebra
• The complete algorithm depends on specific
grammar, which determines forms of the parse
trees
• Here is a flavor of the approach
Conversion
• Suppose there are no subqueries.
• SELECT att-list FROM rel-list WHERE cond
is converted into
PROJatt-list(SELECTcond(PRODUCT(rel-list))), or
att-list(cond( X (rel-list)))
SELECT movieTitle
FROM StarsIn, MovieStar
WHERE starName = name AND birthdate LIKE '%1960';
<Query>
<SFW>
SELECT <SelList> FROM <FromList>
<Attribute>
movieTitle
WHERE
<Condition>
<RelName> , <FromList>
StarsIn
AND <
<RelName>
<Attribut
MovieStar
birthdate
<Condition>
<Attribute> = <Attribute>
starName
name
Equivalent Algebraic
Expression Tree
movieTitle
starname = name AND birthdate LIKE '%1960'
X
StarsIn
MovieStar
Handling Subqueries
• Recall the (equivalent) query:
SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’
);
• Use an intermediate format called twoargument selection
Example: Two-Argument Selection
title

StarsIn
<condition>
<tuple>
<attribute>
starName
MovieStar
IN
name
birthdate LIKE ‘%1960’
Converting Two-Argument
Selection
• To continue the conversion, we need rules for
replacing two-argument selection with a
relational algebra expression
• Different rules depending on the nature of the
sub query
• Here is shown an example for IN operator and
uncorrelated query (sub query computes a
relation independent of the tuple being tested)
Rules for IN

R
C
<Condition>
t
IN
S
X
R

S
C is the condition that equates
attributes in t with corresponding
attributes in S
Example: Logical Query Plan
title
starName=name

StarsIn
name
birthdate LIKE ‘%1960’
MovieStar
What if Subquery is
Correlated?
• Example is when subquery refers to the current
tuple of the outer scope that is being tested
• More complicated to deal with, since subquery
cannot be translated in isolation
• Need to incorporate external attributes in the
translation
• Some details are in textbook
Improving the Logical Query
Plan
• There are numerous algebraic laws
concerning relational algebra operations
• By applying them to a logical query plan
judiciously, we can get an equivalent query
plan that can be executed more efficiently
• Next we'll survey some of these laws
Example: Improved Logical Query Plan
title
starName=name
StarsIn
name
birthdate LIKE ‘%1960’
MovieStar
Associative and Commutative
Operations
•
•
•
•
product
natural join
set and bag union
set and bag intersection
associative: (A op B) op C = A op (B op C)
commutative: A op B = B op A
Laws Involving Selection
• Selections usually reduce the size of the
relation
• Usually good to do selections early, i.e.,
"push them down the tree"
• Also can be helpful to break up a complex
selection into parts
Selection Splitting
•  C1 AND C2 (R) =  C1 (  C2 (R))
•  C1 OR C2 (R) = ( C1 (R)) Uset ( C2 (R))
if R is a set
•  C1 (  C2 (R)) =  C2 (  C1 (R))
Selection and Binary Operators
• Must push selection to both arguments:
–  C (R U S) =  C (R) U  C (S)
• Must push to first arg, optional for 2nd:
–  C (R - S) =  C (R) - S
–  C (R - S) =  C (R) -  C (S)
• Push to at least one arg with all attributes
mentioned in C:
– product, natural join, theta join, intersection
– e.g.,  C (R X S) =  C (R) X S, if R has all the atts in C
Pushing Selection Up the Tree
• Suppose we have relations
– StarsIn(title,year,starName)
– Movie(title,year,len,inColor,studioName)
• and a view
– CREATE VIEW MoviesOf1996 AS
SELECT *
FROM Movie
WHERE year = 1996;
• and the query
– SELECT starName, studioName
FROM MoviesOf1996 NATURAL JOIN StarsIn;
The Straightforward Tree

starName,studioName
year=1996
Movie
StarsIn
Remember the rule
C(R S) = C(R)
S?
The Improved Logical Query Plan
starName,studioName
starName,studioName
starName,studioName
year=1996
year=1996
year=1996 year=1996
StarsIn
Movie
Movie
push selection
up tree
StarsIn
Movie
push selection
down tree
StarsIn
Grouping Assoc/Comm
Operators
• Groups together adjacent joins, adjacent unions, and
adjacent intersections as siblings in the tree
• Sets up the logical QP for future optimization when
physical QP is constructed: determine best order for
doing a sequence of joins (or unions or intersections)
U
D
A U
B
E
U
F
A
C
C
B
D
E
F
16.4 Estimating the Cost of Operations
Project Guide
Prepared By
Dr. T. Y. Lin
Akshay Shenoy
Computer Science Dept
San Jose State University
Introduction

Possible Physical Plan

Estimating Sizes of Intermediate Relations

Estimating the Size of a Projection

Estimating the Size of a Selection

Estimating the Size of a Join

Natural Joins With Multiple Join Attributes

Joins of Many Relations

Estimating Sizes of Other Operations
Physical Plan
For each physical plan select




An order and grouping for associative-andcommutative operations like joins, unions.
An Algorithm for each operator in the logical plan.
eg: whether nested loop join or hash join to be
used
Additional operators that are needed for the
physical plan but that were not present explicitly
in the logical plan. eg: scanning, sorting
The way in which arguments are passed from one
operator to the next.
Estimating Sizes of Intermediate
Relations
Rules for estimating the number of tuples in an
intermediate relation:
1.
Give accurate estimates
2.
Are easy to compute
3.
Are logically consistent

Objective of estimation is to select best
physical plan with least cost.
Estimating the Size of a Projection
The projection is different from the other operators,
in that the size of the result is computable. Since a
projection produces a result tuple for every
argument tuple, the only change in the output size
is the change in the lengths of the tuples.
Estimating the Size of a
Selection(1)

Let S  A  c(R,) ,where A is an attribute
of R and C is a constant. Then we recommend as
an estimate:
T(S) =T(R)/V(R,A)
The rule above surely holds if all values of
attribute A occur equally often in the database.
Estimating the Size of a
Selection(2)



If S  a  c(R) ,then our estimate for
T(s) is: T(S) = T(R)/3
We may use T(S)=T(R)(V(R,a) -1 )/ V(R,a)
as an estimate.
When the selection condition C is the And of
several equalities and inequalities, we can treat the
selection
as a cascade of simple selections,

c (R )
each of which checks for one of the conditions.
Estimating the Size of a
Selection(3)

A less simple, but possibly more accurate estimate of the
size of S  c1 OR c2(R) is to assume that C1 and
of
2
which satisfy C2, we would estimate the number of tuples
in S as
m
n(1  (1  m1 / n)(1  m2 / n))
In explanation,1  m / n is the fraction of tuples that do
not satisfy C1, and1 1 m / n is the fraction that do not
2
satisfy C2. The product of
these numbers is the fraction of
R’s tuples that are not in S, and 1 minus this product is the
fraction that are in S.
Estimating the Size of a Join

two simplifying assumptions:
1. Containment of Value Sets
If R and S are two relations with attribute Y and V(R,Y)<=V(S,Y) then every
Y-value of R will be a Y-value of S.
2. Preservation of Value Sets
Join a relation R with another relation S with attribute A in R and not in S
then all distinct values of A are preserved and not lost.V(S R,A) = V(R,A)
Under these assumptions, we estimate
T(R S) = T(R)T(S)/max(V(R,Y), V(S, Y))
Natural Joins With Multiple Join
Attributes
Of the T(R),T(S) pairs of tuples from R and S, the expected
number of pairs that match in both y1 and y2 is:
T(R)T(S)/max(V(R,y1), V(S,y1)) max(V(R, y2), V(S, y2))
In general, the following rule can be used to estimate the size of a natural
join when there are any number of attributes shared between the two
relations.

The estimate of the size of R
S is computed by
multiplying T(R) by T(S) and dividing by the largest of
V(R,y) and V(S,y) for each attribute y that is common to R
and S.
Joins of Many Relations(1)

rule for estimating the size of any join
Start with the product of the number of tuples in each
relation. Then, for each attribute A appearing at least
twice, divide by all but the least of V(R,A)’s.
We can estimate the number of values that will
remain for attribute A after the join. By the
preservation-of-value-sets assumption, it is the least
of these V(R,A)’s.
Joins of Many Relations(2)
Based on the two assumptions-containment and
preservation of value sets:

No matter how we group and order the terms in a
natural join of n relations, the estimation of rules,
applied to each join individually, yield the same
estimate for the size of the result. Moreover, this
estimate is the same that we get if we apply the rule
for the join of all n relations as a whole.
Estimating Sizes for Other
Operations





Union: the average of
the sum and the larger.
Intersection:
approach1: take the
average of the extremes,
which is the half the
smaller.
approach2:
intersection is an
extreme case of the
naturaljoin, use the
formula
T(R S) =
T(R)T(S)/max(V(R,Y),
V(S, Y))
Estimating Sizes for Other
Operations



Difference: T(R)-(1/2)*T(S)
Duplicate Elimination: take the smaller of
a
i
(1/2)*T(R) and the product of all the V(R, )’s.
Grouping and Aggregation: upper-bound the
number of groups by a product of V(R,A)’s, here
attribute A ranges over only the grouping attributes
of L. An estimate is the smaller of (1/2)*T(R) and
this product.
16.5 Introduction to Costbased plan selection
Amith KC
Student Id: 109


Whether selecting a logical query plan or constructing a
physical query plan from a logical plan, the query optimizer
needs to estimate the cost of evaluating certain expressions.
We shall assume that the "cost" of evaluating an expression is
approximated well by the number of disk I/O's performed.
The number of disk I/O’s, in turn, is influenced by:
1. The particular logical operators chosen to implement the query,
a matter decided when we choose the logical query plan.
2. The sizes of intermediate results (whose estimation we
discussed in Section 16.4)
3. The physical operators used to implement logical operators.
e.g.. The choice of a one-pass or two-pass join, or the choice to
sort or not sort a given relation.
4. The ordering of similar operations, especially joins
5. The method of passing arguments from one physical operator to
the next.
Obtaining Estimates for Size Parameter




The formulas of Section 16.4 were predicated on knowing
certain important parameters, especially T(R), the number of
tuples in a relation R, and V(R, a), the number of different
values in the column of relation R for attribute a.
A modern DBMS generally allows the user or administrator
explicitly to request the gathering of statistics, such as T(R) and
V(R, a). These statistics are then used in subsequent query
optimizations to estimate the cost of operations.
By scanning an entire relation R, it is straightforward to count
the number of tuples T(R) and also to discover the number of
different values V(R, a) for each attribute a.
The number of blocks in which R can fit, B(R), can be
estimated either by counting the actual number of blocks used
(if R is clustered), or by dividing T(R) by the number of tuples
per block
Computation of Statistics

Periodic re-computation of statistics is the norm in most
DBMS's, for several reasons.



First, statistics tend not to change radically in a short time.
Second, even somewhat inaccurate statistics are useful as long as they
are applied consistently to all the plans.
Third, the alternative of keeping statistics up-to-date can make the
statistics themselves into a "hot-spot" in the database; because statistics
are read frequently, we prefer not to update them frequently too.




The recomputation of statistics might be triggered
automatically after some period of time, or after some number
of updates.
However, a database administrator noticing, that poorperforming query plans are being selected by the query
optimizer on a regular basis, might request the recomputation
of statistics in an attempt to rectify the problem.
Computing statistics for an entire relation R can be very
expensive, particularly if we compute V(R, a) for each attribute
a in the relation.
One common approach is to compute approximate statistics by
sampling only a fraction of the data. For example, let us
suppose we want to sample a small fraction of the tuples to
obtain an estimate for V(R, a).
Heuristics for Reducing the Cost of Logical Query
Plans




One important use of cost estimates for queries or sub-queries
is in the application of heuristic transformations of the query.
We have already observed previously how certain heuristics
applied independent of cost estimates can be expected almost
certainly to improve the cost of a logical query plan.
However, there are other points in the query optimization
process where estimating the cost both before and after a
transformation will allow us to apply a transformation where it
appears to reduce cost and avoid the transformation otherwise.
In particular, when the preferred logical query plan is being
generated, we may consider a number of optional
transformations and the costs before and after.



Because we are estimating the cost of a logical query plan, so
we have not yet made decisions about the physical operators
that will be used to implement the operators of relational
algebra, our cost estimate cannot be based on disk I/Os.
Rather, we estimate the sizes of all intermediate results using
the techniques of Section 16.1, and their sum is our heuristic
estimate for the cost of the entire logical plan.
For example,

Consider the initial logical query plan of as shown below,
δ
σa = 10
R
S

The statistics for the relations R and S be as follows
R(a, b)
T(R) = 5000
V(R, a) = 50
V(R, b) = 100

S(b, c)
T(S) = 2000
V(S, a) = 200
V(S, b) = 100
To generate a final logical query plan from, we shall insist that the selection be
pushed down as far as possible. However, we are not sure whether it makes
sense to push the δ below the join or not. Thus, we generate from the two query
plans shown in next slide. They differ in whether we have chosen to eliminate
250
50
δ
500
δ
δ
100 σa = 10
S
2000
5000 R
1000
1000
0
100 σa = 10
S
2000
5000 R
(a)
(b)


We know how to estimate the size of the result of the
selections, we divide T(R) by V(R, a) = 50.
We also know how to estimate the size of the joins; we
multiply the sizes of the arguments and divide by max(V(R, b),
V(S, b)), which is 200.
Approaches to Enumerating Physical
Plans



Let us consider the use of cost estimates in the conversion of a
logical query plan to a physical query plan.
The baseline approach, called exhaustive, is to consider all
combinations of choices (for each of issues like order of joins,
physical implementation of operators, and so on).
Each possible physical plan is assigned an estimated cost, and
the one with the smallest cost is selected.

There are two broad approaches to exploring the space of
possible physical plans:


Top-down: Here, we work down the tree of the logical query plan from
the root.
Bottom-up: For each sub-expression of the logical-query-plan tree, we
compute the costs of all possible ways to compute that sub-expression.
The possibilities and costs for a sub-expression E are computed by
considering the options for the sub-expressions for E, and combining
them in all possible ways with implementations for the root operator of
E.
Branch-and-Bound Plan Enumeration


This approach, often used in practice, begins by using
heuristics to find a good physical plan for the entire logical
query plan. Let the cost of this plan be C. Then as we consider
other plans for sub-queries, we can eliminate any plan for a
sub-query that has a cost greater than C, since that plan for the
sub-query could not possibly participate in a plan for the
complete query that is better than what we already know.
Likewise, if we construct a plan for the complete query that has
cost less than C, we replace C by the cost of this better plan in
subsequent exploration of the space of physical query plans.
Hill Climbing



This approach, in which we really search for a “valley” in the
space of physical plans and their costs; starts with a
heuristically selected physical plan.
We can then make small changes to the plan, e.g., replacing
one method for an operator by another, or reordering joins by
using the associative and/or commutative laws, to find "nearby"
plans that have lower cost.
When we find a plan such that no small modification yields a
plan of lower cost, we make that plan our chosen physical
query plan.
Dynamic Programming


In this variation of the general bottom-UP strategy, we keep for
each sub-expression only the plan of least cost.
As we work UP the tree, we consider possible implementations
of each node, assuming the best plan for each sub-expression is
also used.
Selinger-Style Optimization

This approach improves upon the dynamic-programming
approach by keeping for each sub-expression not only the plan
of least cost, but certain other plans that have higher cost, yet
produce a result that is sorted in an order that may be useful
higher up in the expression tree. Examples of such interesting
orders are when the result of the sub-expression is sorted on
one of:

The attribute(s) specified in a sort (r) operator at the root

The grouping attribute(s) of a later group-by (γ) operator.

The join attribute(s) of a later join.
Choosing an Order for Joins
Chapter 16.6 by:
Chiu Luk
ID: 210
Introduction

This section focuses on critical problem in
cost-based optimization:


Selecting order for natural join of three or more
relations
Compared to other binary operations, joins take
more time and therefore need effective
optimization techniques
Introduction
Significance of Left and Right
Join Arguments


The argument relations in joins determine the
cost of the join
The left argument of the join is



Called the build relation
Assumed to be smaller
Stored in main-memory
Significance of Left and Right
Join Arguments

The right argument of the join is




Called the probe relation
Read a block at a time
Its tuples are matched with those of build relation
The join algorithms which distinguish between
the arguments are:



One-pass join
Nested-loop join
Index join
Join Trees




Order of arguments is important for joining two
relations
Left argument, since stored in main-memory,
should be smaller
With two relations only two choices of join tree
With more than two relations, there are n! ways
to order the arguments and therefore n! join
trees, where n is the no. of relations
Join Trees




Order of arguments is important for joining two
relations
Left argument, since stored in main-memory,
should be smaller
With two relations only two choices of join tree
With more than two relations, there are n! ways
to order the arguments and therefore n! join
trees, where n is the no. of relations
Join Trees

Total # of tree shapes T(n) for n relations given
by recurrence:

T(1) = 1

T(2) = 1

T(3) = 2

T(4) = 5 … etc
Left-Deep Join Trees

Consider 4 relations. Different ways to join
them are as follows




In fig (a) all the right children are leaves. This
is a left-deep tree
In fig (c) all the left children are leaves. This is
a right-deep tree
Fig (b) is a bushy tree
Considering left-deep trees is advantageous for
deciding join orders
Join order

Join order selection


A1
A2
A3
..
Left deep join trees
An
An
Ai

Dynamic programming

Best plan computed for each subset of relations

Best plan (A1, .., An) = min cost plan of(
Best plan(A2, .., An)
A1
Best plan(A1, A3, .., An)
A2
….
Best plan(A1, .., An-1))
An
Dynamic Programming to Select
a Join Order and Grouping

Three choices to pick an order for the join of many relations
are:
Consider all of the relations
 Consider a subset
 Use a heuristic o pick one
Dynamic programming is used either to consider all or a subset




Construct a table of costs based on relation size
Remember only the minimum entry which will required to
proceed
Dynamic Programming to Select
a Join Order and Grouping
Dynamic Programming to Select
a Join Order and Grouping
Dynamic Programming to Select
a Join Order and Grouping
Dynamic Programming to Select
a Join Order and Grouping
A Greedy Algorithm for Selecting
a Join Order



It is expensive to use an exhaustive method like
dynamic programming
Better approach is to use a join-order heuristic
for the query optimization
Greedy algorithm is an example of that

Make one decision at a time about order of join and
never backtrack on the decisions once made
UPDATES

Reducing Cost by Heuristics:
Applies for logical query plan
• Estimate cost before and after a transformation
• Only choose/apply transformation when cost
estimations show beneficial
JOIN TREES:
•


Order of arguments is important for joining two
relations
Left argument, since stored in main-memory,
should be smaller
Completing the Physical-QueryPlan and Chapter 16 Summary
(16.7-16.8)
CS257 Spring 2009
Professor Tsau Lin
Student: Suntorn Sae-Eung
Donavon Norwood
Outline
16.7 Completing the Physical-Query-Plan
I. Choosing a Selection Method
II. Choosing a Join Method
III. Pipelining Versus Materialization
IV. Pipelining Unary Operations
V. Pipelining Binary Operations
VI. Notation for Physical Query Plan
VII. Ordering the Physical Operations
16.8 Summary of Chapter 16
138
Before complete Physical-QueryPlan

A query previously has been





Parsed and Preprocessed (16.1)
Converted to Logical Query Plans (16.3)
Estimated the Costs of Operations (16.4)
Determined costs by Cost-Based Plan Selection
(16.5)
Weighed costs of join operations by choosing an
Order for Joins
139
16.7 Completing the Physical-QueryPlan
3 topics related to turning LP into a complete
physical plan

1.
2.
3.
Choosing of physical implementations such as
Selection and Join methods
Decisions regarding to intermediate results
(Materialized or Pipelined)
Notation for physical-query-plan operators
140
I. Choosing a Selection Method (A)

Algorithms for each selection operators
1. Can we use an created index on an attribute?
 If yes, index-scan. Otherwise table-scan)
2. After retrieve all condition-satisfied tuples in (1),
then filter them with the rest selection conditions
141
Choosing a Selection Method(A)
(cont.)

Recall  Cost of query = # disk I/O’s

How costs for various plans are estimated from σC(R) operation
1. Cost of table-scan algorithm
a)
b)
B(R)
T(R)
if R is clustered
if R is not clustered
2. Cost of a plan picking an equality term (e.g. a = 10) w/ index-scan
a)
b)
B(R) / V(R, a)
T(R) / V(R, a)
clustering index
nonclustering index
3. Cost of a plan picking an inequality term (e.g. b < 20) w/ index-scan
a)
b)
B(R) / 3
T(R) / 3
clustering index
nonclustering index
142
Example
Selection: σx=1 AND y=2 AND z<5 (R)
- Where parameters of R(x, y, z) are :
T(R)=5000,
B(R)=200,
V(R,x)=100, andV(R, y)=500
-
Relation R is clustered
-
x, y have nonclustering indexes, only index on z is
clustering.
143
Example (cont.)
Selection options:
1.
2.
3.
4.
Table-scan  filter x, y, z. Cost is B(R) = 200 since R is
clustered.
Use index on x =1  filter on y, z. Cost is 50 since
T(R) / V(R, x) is (5000/100) = 50 tuples, index is not
clustering.
Use index on y =2  filter on x, z. Cost is 10 since
T(R) / V(R, y) is (5000/500) = 10 tuples using
nonclustering index.
Index-scan on clustering index w/ z < 5  filter x ,y.
Cost is about B(R)/3 = 67
144
Example (cont.)
Costs

option 1 = 200
option 2 = 50
option 3 = 10 
option 4 = 67
The lowest Cost is option 3.
Therefore, the preferred physical plan

1.
retrieves all tuples with y = 2
2.
then filters for the rest two conditions (x, z).
145
II. Choosing a Join Method

Determine costs associated with each join algorithms:
1. One-pass join, and nested-loop join devotes enough buffer
to joining
2. Sort-join is preferred when attributes are pre-sorted or two
or more join on the same attribute such as
(R(a, b)
S(a, c))
T(a, d)
- where sorting R and S on a will produce result of R S to
be sorted on a and used directly in next join
146
Choosing a Join Method (cont.)
3. Index-join for a join with high chance of using
index created on the join attribute such as R(a, b)
S(b, c)
4. Hashing join is the best choice for unsorted or nonindexing relations which needs multipass join.
147
III. Pipelining Versus
Materialization

Materialization (naïve way)


store (intermediate) result of each operations on disk
Pipelining (more efficient way)

Interleave the execution of several operations, the tuples produced by
one operation are passed directly to the operations that used it

store (intermediate) result of each operations on buffer, which is
implemented on main memory
148
IV. Pipelining Unary
Operations


Unary = a-tuple-at-a-time or full relation
selection and projection are the best
candidates for pipelining.
In buf
Unary
operation
Out buf
Unary
operation
Out buf
R
In buf
M-1 buffers
149
Pipelining Unary Operations (cont.)

Pipelining Unary Operations are implemented by
iterators
150
V. Pipelining Binary
Operations




Binary operations : ,  , - , , x
The results of binary operations can also be
pipelined.
Use one buffer to pass result to its consumer,
one block at a time.
The extended example shows tradeoffs and
opportunities
151
Example

Consider physical query plan for the expression
(R(w, x)

S(x, y))
U(y, z)
Assumption




R occupies 5,000 blocks, S and U each 10,000 blocks.
The intermediate result R S occupies k blocks for some
k.
Both joins will be implemented as hash-joins, either onepass or two-pass depending on k
There are 101 buffers available.
152
Example (cont.)

R
First consider join
S, neither relations
fits in buffers

Needs two-pass
hash-join to partition
R into 100 buckets
(maximum possible) each bucket has 50 blocks

The 2nd pass hash-join uses 51 buffers, leaving the
rest 50 buffers for joining result of R S with U.
153
Example (cont.)
Case 1: suppose k  49, the result of
occupies at most 49 blocks.
Steps


1.
2.
3.
4.
R
S
Pipeline in R S into 49 buffers
Organize them for lookup as a hash table
Use one buffer left to read each block of U in turn
Execute the second join as one-pass join.
154
Example (cont.)

The total number of I/O’s
is 55,000


45,000 for two-pass hash
join of R and S
10,000 to read U for onepass hash join of
(R S) U.
155
Example (cont.)

1.
2.
3.
Case 2: suppose k > 49 but < 5,000, we can still
pipeline, but need another strategy which
intermediate results join with U in a 50-bucket, twopass hash-join. Steps are:
Before start on R
blocks each.
S, we hash U into 50 buckets of 200
Perform two-pass hash join of R and U using 51 buffers as
case 1, and placing results in 50 remaining buffers to form 50
buckets for the join of R S with U.
Finally, join R
S with U bucket by bucket.
156
Example (cont.)

The number of disk I/O’s is:





20,000 to read U and write its tuples into buckets
45,000 for two-pass hash-join R S
k to write out the buckets of R S
k+10,000 to read the buckets of R S and U in the
final join
The total cost is 75,000+2k.
157
Example (cont.)

Compare Increasing I/O’s between case 1 and
case 2

k  49 (case 1)


Disk I/O’s is 55,000
k > 50  5000 (case 2)
k=50 , I/O’s is 75,000+(2*50) = 75,100
 k=51 , I/O’s is 75,000+(2*51) = 75,102
 k=52 , I/O’s is 75,000+(2*52) = 75,104
Notice: I/O’s discretely grows as k increases from 49 50.

158
Example (cont.)

1.
Case 3: k > 5,000, we cannot perform twopass join in 50 buffers available if result of R
S is pipelined. Steps are
Compute R S using two-pass join and store the
result on disk.
2.
Join result on (1) with U, using two-pass join.
159
Example (cont.)

The number of disk I/O’s is:




45,000 for two-pass hash-join R and S
k to store R S on disk
30,000 + k for two-pass join of U in R S
The total cost is 75,000+4k.
160
Example (cont.)

In summary, costs of physical plan as function
of R S size.
161
VI. Notation for Physical Query
Plans
Several types of operators:

1.
2.
3.
4.

Operators for leaves
(Physical) operators for Selection
(Physical) Sorts Operators
Other Relational-Algebra Operations
In practice, each DBMS uses its own internal
notation for physical query plan.
162
Notation for Physical Query Plans
(cont.)
Operator for leaves
1.

A leaf operand is replaced in LQP tree

TableScan(R) : read all blocks

SortScan(R, L) : read in order according to L

IndexScan(R, C): scan index attribute A by
condition C of form Aθc.

IndexScan(R, A) : scan index attribute R.A. This
behaves like TableScan but more efficient if R is not
clustered.
163
Notation for Physical Query Plans
(cont.)
(Physical) operators for Selection
2.

Logical operator σC(R) is often combined with
access methods.

If σC(R) is replaced by Filter(C), and there is no
index on R or an attribute on condition C
Use TableScan or SortScan(R, L) to access R
If condition C  Aθc AND D for condition D, and
there is an index on R.A, then we may
Use operator IndexScan(R, Aθc) to access R and
Use Filter(D) in place of the selection σC(R)




164
Notation for Physical Query Plans
(cont.)
(Physical) Sort Operators
3.



Sorting can occur any point in physical plan,
which use a notation SortScan(R, L).
It is common to use an explicit operator Sort(L)
to sort relation that is not stored.
Can apply at the top of physical-query-plan tree if
the result needs to be sorted with ORDER BY
clause (г).
165
Notation for Physical Query Plans
(cont.)
Other Relational-Algebra Operations
4.

Descriptive text definitions and signs to elaborate

Operations performed e.g. Join or grouping.

Necessary parameters e.g. theta-join or list of elements
in a grouping.

A general strategy for the algorithm e.g. sort-based,
hashed based, or index-based.

A decision about number of passed to be used e.g. onepass, two-pass or multipass.

An anticipated number of buffers the operations will
required.
166
Notation for Physical Query Plans
(cont.)

Example of a physical-query-plan

A physical-query-plan in example 16.36 for the case k >
5000




TableScan
Two-pass hash join
Materialize (double line)
Store operator
167
Notation for Physical Query Plans
(cont.)

Another example

A physical-query-plan in example 16.36 for the case k <
49





TableScan
(2) Two-pass hash join
Pipelining
Different buffers needs
Store operator
168
Notation for Physical Query Plans
(cont.)

A physical-query-plan in example 16.35


Use Index on condition y = 2 first
Filter with the rest condition later on.
169
VII. Ordering of Physical
Operations
The PQP is represented as a tree structure
implied order of operations.
Still, the order of evaluation of interior nodes
may not always be clear.




Iterators are used in pipeline manner
Overlapped time of various nodes will make
“ordering” no sense.
170
Ordering of Physical Operations
(cont.)
3 rules summarize the ordering of events in a
PQP tree:

1.
Break the tree into sub-trees at each edge that
represent materialization.

2.
Order the execution of the subtree


3.
Execute one subtree at a time.
Bottom-top
Left-to-right
All nodes of each sub-tree are executed
simultaneously.
171
Summary of Chapter 16
In this part of the presentation I will talk about
the main topics of Chapter 16.
172
COMPILATION OF QUERIES


Compilation means turning a query into a
physical query plan, which can be implemented
by query engine.
Steps of query compilation :




Parsing
Semantic checking
Selection of the preferred logical query plan
Generating the best physical plan
173
THE PARSER




The first step of SQL query processing.
Generates a parse tree
Nodes in the parse tree corresponds to the SQL
constructs
Similar to the compiler of a programming
language
174
VIEW EXPANSION



A very critical part of query compilation.
Expands the view references in the query tree
to the actual view.
Provides opportunities for the query
optimization.
175
SEMANTIC CHECKING



Checks the semantics of a SQL query.
Examines a parse tree.
Checks :




Attributes
Relation names
Types
Resolves attribute references.
176
CONVERSION TO A LOGICAL
QUERY PLAN



Converts a semantically parsed tree to a
algebraic expression.
Conversion is straightforward but sub queries
need to be optimized.
Two argument selection approach can be used.
177
ALGEBRAIC
TRANSFORMATION


Many different ways to transform a logical query plan to an
actual plan using algebraic transformations.
The laws used for this transformation :







Commutative and associative laws
Laws involving selection
Pushing selection
Laws involving projection
Laws about joins and products
Laws involving duplicate eliminations
Laws involving grouping and aggregation
178
ESTIMATING SIZES OF
RELATIONS


True running time is taken into consideration
when selecting the best logical plan.
Two factors the affects the most in estimating
the sizes of relation :



Size of relations ( No. of tuples )
No. of distinct values for each attribute of each
relation
Histograms are used by some systems.
179
COST BASED OPTIMIZING


Best physical query plan represents the least
costly plan.
Factors that decide the cost of a query plan :




Order and grouping operations like joins, unions
and intersections.
Nested loop and the hash loop joins used.
Scanning and sorting operations.
Storing intermediate results.
180
PLAN ENUMERATION
STRATEGIES

Common approaches for searching the space
for best physical plan .




Dynamic programming : Tabularizing the best plan
for each sub expression
Selinger style programming : sort-order the results
as a part of table
Greedy approaches : Making a series of locally
optimal decisions
Branch-and-bound : Starts with enumerating the
worst plans and reach the best plan
181
LEFT-DEEP JOIN TREES



Left – Deep Join Trees are the binary trees with
a single spine down the left edge and with
leaves as right children.
This strategy reduces the number of plans to be
considered for the best physical plan.
Restrict the search to Left – Deep Join Trees
when picking a grouping and order for the join
of several relations.
182
PHYSICAL PLANS FOR
SELECTION



Breaking a selection into an index-scan of
relation, followed by a filter operation.
The filter then examines the tuples retrieved by
the index-scan.
Allows only those to pass which meet the
portions of selection condition.
183
PIPELINING VERSUS
MATERIALIZING





This flow of data between the operators can be controlled to
implement “ Pipelining “ .
The intermediate results should be removed from main memory
to save space for other operators.
This techniques can implemented using “ materialization “ .
Both the pipelining and the materialization should be
considered by the physical query plan generator.
An operator always consumes the result of other operator and is
passed through the main memory.
184
UPDATES
Pipelining Versus Materialization

Materialization (naïve way)


store (intermediate) result of each operations on disk
Pipelining (more efficient way)

Interleave the execution of several operations, the tuples produced by
one operation are passed directly to the operations that used it

store (intermediate) result of each operations on buffer, which is
implemented on main memory