Query Optimization R&G, Chapter 15 Lecture 16 Administrivia • Homework 3 available today – Written exercise; will be posted on class website – Due date:
Download ReportTranscript Query Optimization R&G, Chapter 15 Lecture 16 Administrivia • Homework 3 available today – Written exercise; will be posted on class website – Due date:
Query Optimization R&G, Chapter 15 Lecture 16 Administrivia • Homework 3 available today – Written exercise; will be posted on class website – Due date: Tuesday, March 20 by end of class period • Homework 4 available later this week – Implement nested loops and hash join operators for minibase – Due date: April 5 (after Spring Break) • Midterm 2 is 3/22, 2 weeks from today – In class, covers lectures 10-17 – Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall Review Now you are here Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB •Query plans are a tree of operators that compute the result of a query •Optimization is the process of picking the best plan •Execution is the process of executing the plan You were here •Operators are the building blocks for computing results of queries •Sort •Project •Join •Filter •Access methods for files •... Query Plans: turning text into tuples Query Result Query Shift Name SELECT A.aname, max(F.feedingshift) FROM Animals A, Feeding F WHERE A.aid = F.aid AND (A.species = 'Big Cat' or A.species = 'Bear') GROUP BY A.aname HAVING COUNT(*) > 1 Aslan 3 Bageera 3 Elsa 3 Shere Khan 3 Tigger 2 Operators Query Plan 40 100 Aslan Big Cat 50 300 Baloo Bear 60 100 Bageera Big Cat 3 70 100 Shere Khan Big Cat 100 3 90 100 Dumbo Elephant … … … … … … 10 2 1 100 3 10 3 2 100 3 20 3 2 100 3 20 2 3 100 30 3 2 … … … Operator Review • • Access Path : pulls tuples from tables – File scans – Index scans (clustered or unclustered) – Index-only scans Select (or Filter): conditionally excludes tuples – Can be ‘pushed’/combined with Access Path operator • Use indexes where possible and apply other predicates on the result • – Can also be applied at intermediate point in query plan Projection: removes columns and duplicates – Column projection often done by operators – Duplicate elimination via Sort or Hash Operator Review • Sort: sorts tuples in a particular order – Simple merge sort – General external merge sort (with various optimizations) – B+ tree traversal • Join: combine tuples from 2 other operators – Page nested loops – Block nested loops – Index nested loops – Sort-merge join – Hash-join • Other operators for – Group By, Temping, … Query Optimization steps SELECT A.aname, max(F.feedingshift) FROM Animals A, Feeding F WHERE A.aid = F.aid AND (A.species = 'Big Cat' or A.species = 'Bear') GROUP BY A.aname HAVING COUNT(*) > 1 1. Parse query from text to ‘intermediate model’ 2. Traverse ‘intermediate model’ and produce alternate query plans – – Query parser Block 3 Block 2 Query optimizer Query plan = tree of relational operators Optimizer keeps track of cost and properties of plans 3. Pick the cheapest plan 4. Pass cheapest plan on to query execution engine to execute and produce results of query Block 1 Cost = 200 Cost = 150 Cost = 500 To execution engine Query Blocks: Units of Optimization • Intermediate model is a set of query blocks – 1 per SELECT/FROM/WHERE/GROUP BY/HAVING clause SELECT A.aname, max(F.feedingshift) FROM Animals A, Feeding F WHERE A.aid = F.aid AND (A.species = 'Big Cat' or A.species = 'Bear') GROUP BY A.aname HAVING COUNT(*) > 1 Query Block • Subqueries produce nested query blocks – treated as calls to a subroutine, made once per tuple produced by outer query block – sometimes subqueries can be rewritten to produce cheaper plan SELECT S.sname FROM Sailors S WHERE S.age IN (SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating) Outer Query Block X Nested Query Block Rewritten Query Block Query blocks are optimized 1 at a time 1. Convert block to relational algebra tree 2. Traverse tree and build plan bottom up: • Pick best access method for each relation in FROM clause • • Applying predicates if possible Consider all join trees • • All ways to join relations in FROM clause 1-at-a time Consider multiple permutations and join methods – But not all! too many choices – Restrict to left-deep plans – Prune bad plans along the way Query Block Converting Query Blocks to Relational Algebra Trees • SQL is relationally complete; can express everything in relational algebra SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 (sname)(bid=100 rating > 5) (Reserves Sailors) sname bid=100 rating > 5 sid=sid Reserves Sailors SQL extends Relational Algebra • SQL is more powerful than relational algebra – extend relational algebra to include aggregate ops: GROUP BY, HAVING • How is this query block expressed? SELECT S.sname FROM Sailors S WHERE S.age IN (constant set from subquery) Πsname(σ σ(age in set from subquery) Sailors Sailors) • And this query block? SELECT MAX (S2.age) FROM Sailors S2 GROUP BY S2.rating ΠMax(age)(GroupBy GroupByRating (Sailors) ) Rating(Sailors) Why optimize? sname bid=100 rating > 5 sid=sid Reserves Sailors • Operators have implementation choices – Index scan? File scan? Nested loop join? Sort merge? • Operators can also be applied in different order! Motivating Example -Schema used Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string) • As seen in previous lectures… • Reserves: – Each tuple is 40 bytes long, 100 tuples per page, 1000 pages. – Assume there are 100 boats • Sailors: – Each tuple is 50 bytes long, 80 tuples per page, 500 pages. – Assume there are 10 different ratings • Assume we have 5 pages in our buffer pool! Motivating Example SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5 • Cost: 500+500*1000 I/Os • Not the worst plan, but… • Misses several opportunities: – selections could have been `pushed’ earlier, – indexes might have been helpful…. • Goal of optimization: To find more efficient plans that compute the same answer. (On-the-fly) sname bid=100 Plan: rating > 5 (On-the-fly) (Page-Oriented sid=sid Nested loops) Sailors Reserves 500 1000 Selectivity calculation • Sailors: 500 pages, 80 tuples per page, 10 ratings • Selectivity of S.rating > 5? – ½ -> 500*80/2 = 20,000 tuples – 20,000/80 = 250 pages • Reserves: 1000 pages, 100 tuples per page, 100 boats • Selectivity of R.bid = 100? – 1/100 -> 1000*100/100 = 1000 tuples – 1000/100 = 10 pages Alternative Plans – Push Selects (No Indexes) (On-the-fly) sname (On-the-fly) sname bid=100 bid=100 rating > 5 (On-the-fly) (On-the-fly) (Page-Oriented sid=sid Nested loops) (Page-Oriented sid=sid Nested loops) 500 Sailors 250 rating > 5 1000 Reserves 500,500 IOs 1000 (On-the-fly) Reserves 500 Sailors 500 + 250 *1000 = 250,500 IOs Alternative Plans – Push Selects (No Indexes) sname (On-the-fly) sname bid=100 (On-the-fly) (On-the-fly) (Page-Oriented sid=sid Nested loops) (Page-Oriented sid=sid Nested loops) 250 rating > 5 250 10 bid = 100 rating > 5 1000 (On-the-fly) (On-the-fly) (On-the-fly) Reserves 500 Sailors Reserves Sailors 250,500 IOs 500 + 250 *1000 = 250,500 IOs 1000 Alternative Plans – Try different join order sname bid=100 (On-the-fly) rating > 5 (On-the-fly) (Page-Oriented sid=sid Nested loops) (On-the-fly) sname 10 (On-the-fly) (Page-Oriented sid=sid Nested loops) 500 swap rating > 5 (On-the-fly) Reserves Sailors 250,500 IOs bid=100 (On-the-fly) Sailors 1000 Reserves 1000 + 10 *500= 6000 IOs Alternative Plans – Push Selects and precompute result (No Indexes) (On-the-fly) sname sname rating > 5 bid=100 (On-the-fly) (Page-Oriented sid=sid Nested loops) 10 500 (On-the-fly) (On-the-fly) Sailors (Page-Oriented sid=sid Nested loops) 10 250 bid=100 rating > 5 (On-the-fly) 1000 Reserves (Scan & Write to 250 temp T2) Sailors 500 Reserves 1000 6000 IOs 1000 + 500+ 250 + (10 * 250) = 4250 IOs Alternative Plans – Try different join order (On-the-fly) sname 10 bid=100 (Page-Oriented sid=sid Nested loops) 250 swap rating > 5 (On-the-fly) (Scan & Write to temp T2) (On-the-fly) sname 250 (Page-Oriented sid=sid Nested loops) rating>5 (On-the-fly) 10 bid=100 (Scan & Write to 10 temp T2) 250 Reserves 1000 4250 IOs Sailors Sailors 500 500 Reserves 1000 500 + 1000 +10 +(250 *10) = 4010 IOs Optimized query is 124x cheaper than the original! (On-the-fly) sname bid=100 rating > 5 Reserves 500,500 IOs (On-the-fly) (Page-Oriented sid=sid Nested loops) (On-the-fly) (Page-Oriented sid=sid Nested loops) Sailors sname bid=100 rating>5 (On-the-fly) Sailors (Scan & Write to temp T2) Reserves 4010 IOs More Alternative Plans (No Indexes) • Main difference: Sort Merge Join (On-the-fly) sname (Sort-Merge Join) sid=sid (Scan; write to bid=100 temp T1) Reserves rating > 5 (Scan; write to temp T2) Sailors • With 5 buffers, cost of plan: – Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats, uniform distribution) = 1010. – Scan Sailors (500) + write temp T2 (250 pages, if have 10 ratings) = 750. – Sort T1 (2*2*10) + sort T2 (2*4*250) + merge (10+250) = 2300 Total: 4060 page I/Os. • If use BNL join, join = 10+4*250, total cost = 2770. • Can also `push’ projections, but must be careful! – T1 has only sid, T2 only sid, sname: – T1 fits in 3 pgs, cost of BNL under 250 pgs, total < 2000. – (On-the-fly) More Alt Plans: Indexes • With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples on 1000/100 = 10 pages. • INL with outer not materialized. sname rating > 5 (On-the-fly) (Index Nested Loops, sid=sid with pipelining ) (Use hash Index, do bid=100 not write to temp) – Projecting out unnecessary fields from outer doesn’t help. Sailors Reserves Join column sid is a key for Sailors. –At most one matching tuple, unclustered index on sid OK. Decision not to push rating>5 before the join is based on availability of sid index on Sailors. Cost: Selection of Reserves tuples (10 I/Os); then, for each, must get matching Sailors tuple (1000*1.2); total 1210 I/Os. What is needed for optimization? • A closed set of operators – Relational ops (table in, table out) – Encapsulation based on iterators • Plan space, based on – Based on relational equivalences, different implementations • Cost Estimation, based on – Cost formulas – Size estimation, based on • Catalog information on base tables • Selectivity (Reduction Factor) estimation • A search algorithm – To sift through the plan space based on cost!