Query Optimization R&G, Chapter 15 Lecture 16 Administrivia • Homework 3 available today – Written exercise; will be posted on class website – Due date:

Download Report

Transcript Query Optimization R&G, Chapter 15 Lecture 16 Administrivia • Homework 3 available today – Written exercise; will be posted on class website – Due date:

Query Optimization
R&G, Chapter 15
Lecture 16
Administrivia
• Homework 3 available today
– Written exercise; will be posted on class website
– Due date: Tuesday, March 20 by end of class period
• Homework 4 available later this week
– Implement nested loops and hash join operators for
minibase
– Due date: April 5 (after Spring Break)
• Midterm 2 is 3/22, 2 weeks from today
– In class, covers lectures 10-17
– Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall
Review
Now you are here
Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
•Query plans are a tree of operators that
compute the result of a query
•Optimization is the process of picking the
best plan
•Execution is the process of executing the
plan
You were here
•Operators are the building blocks for
computing results of queries
•Sort
•Project
•Join
•Filter
•Access methods for files
•...
Query Plans: turning text into tuples
Query Result
Query
Shift
Name
SELECT A.aname, max(F.feedingshift)
FROM Animals A, Feeding F
WHERE A.aid = F.aid AND
(A.species = 'Big Cat' or A.species = 'Bear')
GROUP BY A.aname
HAVING COUNT(*) > 1
Aslan
3
Bageera
3
Elsa
3
Shere Khan
3
Tigger
2
Operators
Query Plan
40
100
Aslan
Big Cat
50
300
Baloo
Bear
60
100
Bageera
Big Cat
3
70
100
Shere
Khan
Big Cat
100
3
90
100
Dumbo
Elephant
…
…
…
…
…
…
10
2
1
100
3
10
3
2
100
3
20
3
2
100
3
20
2
3
100
30
3
2
…
…
…
Operator Review
•
•
Access Path : pulls tuples from tables
– File scans
– Index scans (clustered or unclustered)
– Index-only scans
Select (or Filter): conditionally excludes
tuples
– Can be ‘pushed’/combined with Access Path
operator
• Use indexes where possible and apply other
predicates on the result
•
– Can also be applied at intermediate point in
query plan
Projection: removes columns and duplicates
– Column projection often done by operators
– Duplicate elimination via Sort or Hash
Operator Review
• Sort: sorts tuples in a particular order
– Simple merge sort
– General external merge sort (with
various optimizations)
– B+ tree traversal
• Join: combine tuples from 2 other
operators
– Page nested loops
– Block nested loops
– Index nested loops
– Sort-merge join
– Hash-join
• Other operators for
– Group By, Temping, …
Query Optimization steps
SELECT A.aname, max(F.feedingshift)
FROM Animals A, Feeding F
WHERE A.aid = F.aid AND
(A.species = 'Big Cat' or A.species = 'Bear')
GROUP BY A.aname
HAVING COUNT(*) > 1
1. Parse query from text to
‘intermediate model’
2. Traverse ‘intermediate model’ and
produce alternate query plans
–
–
Query parser
Block 3
Block 2
Query optimizer
Query plan = tree of relational
operators
Optimizer keeps track of cost and
properties of plans
3. Pick the cheapest plan
4. Pass cheapest plan on to query
execution engine to execute and
produce results of query
Block 1
Cost = 200
Cost = 150
Cost = 500
To execution engine
Query Blocks: Units of Optimization
• Intermediate model is a set of
query blocks
– 1 per
SELECT/FROM/WHERE/GROUP
BY/HAVING clause
SELECT A.aname, max(F.feedingshift)
FROM Animals A, Feeding F
WHERE A.aid = F.aid AND (A.species = 'Big Cat'
or A.species = 'Bear')
GROUP BY A.aname
HAVING COUNT(*) > 1
Query Block
• Subqueries produce
nested query blocks
– treated as calls to a
subroutine, made once
per tuple produced by
outer query block
– sometimes subqueries can
be rewritten to produce
cheaper plan
SELECT S.sname
FROM Sailors S
WHERE S.age IN
(SELECT MAX (S2.age)
FROM Sailors S2
GROUP BY S2.rating)
Outer Query Block
X
Nested Query Block
Rewritten Query Block
Query blocks are optimized 1 at a time
1. Convert block to relational algebra
tree
2. Traverse tree and build plan bottom
up:
• Pick best access method for each
relation in FROM clause
•
•
Applying predicates if possible
Consider all join trees
•
•
All ways to join relations in FROM
clause 1-at-a time
Consider multiple permutations and join
methods
– But not all! too many choices
– Restrict to left-deep plans
– Prune bad plans along the way
Query Block
Converting Query Blocks to Relational
Algebra Trees
• SQL is relationally complete; can express
everything in relational algebra
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
(sname)(bid=100  rating > 5) (Reserves  Sailors)
sname
bid=100
rating > 5
sid=sid
Reserves
Sailors
SQL extends Relational Algebra
• SQL is more powerful than relational algebra
– extend relational algebra to include aggregate
ops: GROUP BY, HAVING
• How is this query block expressed?
SELECT S.sname
FROM Sailors S
WHERE S.age IN (constant set from subquery)
Πsname(σ
σ(age in set from subquery) Sailors
Sailors)
• And this query block?
SELECT MAX (S2.age)
FROM Sailors S2
GROUP BY S2.rating
ΠMax(age)(GroupBy
GroupByRating
(Sailors) )
Rating(Sailors)
Why optimize?
sname
bid=100
rating > 5
sid=sid
Reserves
Sailors
• Operators have implementation choices
– Index scan? File scan? Nested loop join? Sort merge?
• Operators can also be applied in different order!
Motivating Example -Schema used
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)
• As seen in previous lectures…
• Reserves:
– Each tuple is 40 bytes long, 100 tuples per page, 1000
pages.
– Assume there are 100 boats
• Sailors:
– Each tuple is 50 bytes long, 80 tuples per page, 500 pages.
– Assume there are 10 different ratings
• Assume we have 5 pages in our buffer pool!
Motivating Example
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5
• Cost: 500+500*1000 I/Os
• Not the worst plan, but…
• Misses several opportunities:
– selections could have been
`pushed’ earlier,
– indexes might have been
helpful….
• Goal of optimization: To find
more efficient plans that
compute the same answer.
(On-the-fly)
sname
bid=100
Plan:
rating > 5
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
Sailors
Reserves
500
1000
Selectivity calculation
• Sailors: 500 pages, 80 tuples per page, 10
ratings
• Selectivity of S.rating > 5?
– ½ -> 500*80/2 = 20,000 tuples
– 20,000/80 = 250 pages
• Reserves: 1000 pages, 100 tuples per page,
100 boats
• Selectivity of R.bid = 100?
– 1/100 -> 1000*100/100 = 1000 tuples
– 1000/100 = 10 pages
Alternative Plans – Push Selects
(No Indexes)
(On-the-fly)
sname
(On-the-fly)
sname
bid=100
bid=100
rating > 5
(On-the-fly)
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(Page-Oriented
sid=sid Nested loops)
500
Sailors
250
rating > 5
1000
Reserves
500,500 IOs
1000
(On-the-fly) Reserves
500 Sailors
500 + 250 *1000 =
250,500 IOs
Alternative Plans – Push Selects
(No Indexes)
sname
(On-the-fly)
sname
bid=100
(On-the-fly)
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(Page-Oriented
sid=sid Nested loops)
250
rating > 5
250
10
bid = 100
rating > 5
1000
(On-the-fly)
(On-the-fly)
(On-the-fly) Reserves
500 Sailors
Reserves
Sailors
250,500 IOs
500 + 250 *1000 =
250,500 IOs
1000
Alternative Plans – Try different join
order
sname
bid=100
(On-the-fly)
rating > 5
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(On-the-fly)
sname
10
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
500
swap
rating > 5
(On-the-fly) Reserves
Sailors
250,500 IOs
bid=100
(On-the-fly)
Sailors
1000 Reserves
1000 + 10 *500=
6000 IOs
Alternative Plans – Push Selects and
precompute result (No Indexes)
(On-the-fly)
sname
sname
rating > 5
bid=100
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
10
500
(On-the-fly)
(On-the-fly)
Sailors
(Page-Oriented
sid=sid Nested loops)
10
250
bid=100
rating > 5
(On-the-fly)
1000 Reserves
(Scan &
Write to 250
temp T2)
Sailors
500
Reserves
1000
6000 IOs
1000 + 500+ 250 + (10 * 250) = 4250 IOs
Alternative Plans – Try different join
order
(On-the-fly)
sname
10
bid=100
(Page-Oriented
sid=sid Nested loops)
250
swap
rating > 5
(On-the-fly)
(Scan &
Write to
temp T2)
(On-the-fly)
sname
250
(Page-Oriented
sid=sid Nested loops)
rating>5
(On-the-fly)
10
bid=100
(Scan &
Write to 10
temp T2)
250
Reserves
1000
4250 IOs
Sailors
Sailors
500
500
Reserves
1000
500 + 1000 +10 +(250 *10) = 4010 IOs
Optimized query is 124x cheaper
than the original!
(On-the-fly)
sname
bid=100
rating > 5
Reserves
500,500 IOs
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
Sailors
sname
bid=100
rating>5
(On-the-fly)
Sailors
(Scan &
Write to
temp T2)
Reserves
4010 IOs
More Alternative Plans
(No Indexes)
• Main difference:
Sort Merge Join
(On-the-fly)
sname
(Sort-Merge Join)
sid=sid
(Scan;
write to bid=100
temp T1)
Reserves
rating > 5
(Scan;
write to
temp T2)
Sailors
• With 5 buffers, cost of plan:
– Scan Reserves (1000) + write temp T1 (10 pages, if we
have 100 boats, uniform distribution) = 1010.
– Scan Sailors (500) + write temp T2 (250 pages, if have 10 ratings) =
750.
– Sort T1 (2*2*10) + sort T2 (2*4*250) + merge (10+250) = 2300
Total: 4060 page I/Os.
• If use BNL join, join = 10+4*250, total cost = 2770.
• Can also `push’ projections, but must be careful!
– T1 has only sid, T2 only sid, sname:
– T1 fits in 3 pgs, cost of BNL under 250 pgs, total < 2000.
–
(On-the-fly)
More Alt Plans: Indexes
• With clustered index on bid
of Reserves, we get
100,000/100 = 1000 tuples
on 1000/100 = 10 pages.
• INL with outer not
materialized.
sname
rating > 5
(On-the-fly)
(Index Nested Loops,
sid=sid with pipelining )
(Use hash
Index, do
bid=100
not write
to temp)
– Projecting out unnecessary fields from outer
doesn’t help.
Sailors
Reserves

Join column sid is a key for Sailors.
–At most one matching tuple, unclustered index on sid OK.

Decision not to push rating>5 before the join is based on
availability of sid index on Sailors.
Cost: Selection of Reserves tuples (10 I/Os); then, for each,
must get matching Sailors tuple (1000*1.2); total 1210 I/Os.

What is needed for optimization?
• A closed set of operators
– Relational ops (table in, table out)
– Encapsulation based on iterators
• Plan space, based on
– Based on relational equivalences, different
implementations
• Cost Estimation, based on
– Cost formulas
– Size estimation, based on
• Catalog information on base tables
• Selectivity (Reduction Factor) estimation
• A search algorithm
– To sift through the plan space based on cost!