CS186: Introduction to Database Systems

Download Report

Transcript CS186: Introduction to Database Systems

EECS 647: Introduction to
Database Systems
Instructor: Luke Huan
Spring 2007
Administrative


Homework 4 is due today
Homework 5 is assigned today, it is due April 23.
7/17/2015
Luke Huan Univ. of Kansas
2
A DBMS Review
7/17/2015
Luke Huan Univ. of Kansas
3
Review DBMS Architecture
User/Web Forms/Applications/DBA
query
transaction
Query Parser
Transaction Manager
Query Rewriter
Query Optimizer
Lock Manager
Logging &
Recovery
Query Executor
Files & Access Methods
Buffer Manager
Buffers
Lock Tables
Main Memory
Storage Manager
Storage
7/17/2015
Luke Huan Univ. of Kansas
4
Review Operators in QP

Logical operators:


Physical operators:


what they do
how they do it
Exercise: L: logical operator, P: physical operator




Union
Join
Nested loop join
Sort-merge join
7/17/2015
L
L
P
P
Luke Huan Univ. of Kansas
5
Review Query Tree and Physical Operators
 buyer,item
City=‘urbana’
SELECT P.buyer, P.item
FROM Purchase P, Person Q
WHERE P.buyer=Q.name
AND Q.city=‘urbana’
buyer=name
Purchase
(Table scan)
7/17/2015
Luke Huan Univ. of Kansas
(Simple
nested loops)
Person
(Index scan)
6
Preview A query’s trip through the DBMS
SQL query
<Query>
<SFW>
<select-list> <where-cond>
<from-list>
…
…
<table> <table>
Enroll Course
PROJECT (title, SID)
Sort MERGE-JOIN (CID)
Parser
Parse tree
Rewritor
Logical plan
Optimizer
¼title, SID
¾Enroll.CID = Course.CID
£
Enroll Course
Physical plan
SORT (CID) SORT (CID)
Executor
SCAN (Enroll)
SCAN (Course)
Result
7/17/2015
SELECT title, SID
FROM Enroll, Course
WHERE Enroll.CID =
Course.CID;
Luke Huan Univ. of Kansas
7
Today’s Topic

A system view of DBMS query processing process





Parser
Rewriter
Optimizer
Executor
Query optimization


Heuristic based methods
Cost based methods
7/17/2015
Luke Huan Univ. of Kansas
8
Parsing and validation

Parser: SQL ! parse tree



Good old lex & yacc
Detect and reject syntax errors
Validator: validate parse tree

Detect and reject semantic errors



Nonexistent tables/views/columns?
Insufficient access privileges?
Type mismatches?


Also



Examples: AVG(name), name + GPA, Student UNION Enroll
Expand *
Expand view definitions
Information required for semantic checking is found in system
catalog (contains all schema information)
7/17/2015
Luke Huan Univ. of Kansas
9
Logical plan


Nodes are logical operators (often relational algebra operators)
There are many equivalent logical plans
¼title
¾Student.name=“Bart” Æ Student.SID = Enroll.SID Æ Enroll.CID = Course.CID
£
£
Course
Student Enroll
¼title
An equivalent plan: !
Enroll.CID = Course.CID
Course
!Student.SID = Enroll.SID
Enroll
¾name = “Bart”
Student
7/17/2015
Luke Huan Univ. of Kansas
10
Examples of physical plans

Many physical plans for a single query


Equivalent results, but different costs and assumptions!
DBMS query optimizer picks the “best” possible physical plan
¼title
PROJECT
PROJECT(title)
(title)
!Enroll.CID = Course.CID INDEX-NESTED-LOOP-JOIN
MERGE-JOIN (CID) (CID)
Course
Index
Course(CID)
SCANon(Course)
SORT
(CID)
!Student.SID = Enroll.SID INDEX-NESTED-LOOP-JOIN (SID)
MERGE-JOIN (SID)
Enroll
Index on Enroll(SID)
SORT (SID)
¾name = “Bart”
FILTER
(name
=
“Bart”)
INDEX-SCAN (name = “Bart”)
SCAN (Enroll)
Student
Index
on(Student)
Student(name)
SCAN
7/17/2015
Luke Huan Univ. of Kansas
11
Physical plan execution

How are intermediate results passed from child
operators to parent operators?

Temporary files




Compute the tree bottom-up
Children write intermediate results to temporary files
Parents read temporary files
Iterators


7/17/2015
Do not materialize intermediate results
Children pipeline their results to parents
Luke Huan Univ. of Kansas
12
Iterator interface

Every physical operator maintains its own execution
state and implements the following methods:



open(): Initialize state and get ready for processing
getNext(): Return the next tuple in the result (or a null
pointer if there are no more tuples); adjust state to allow
subsequent tuples to be obtained
close(): Clean up
7/17/2015
Luke Huan Univ. of Kansas
13
An iterator for table scan



State: a block of memory for buffering input R;
a pointer to a tuple within the block
open(): allocate a block of memory
getNext()

If no block of R has been read yet, read the first block from the disk
and return the first tuple in the block


If there is no more tuple left in the current block, read the next block
of R from the disk and return the first tuple in the block



Or the null pointer if R is empty
Or the null pointer if there are no more blocks in R
Otherwise, return the next tuple in the memory block
close(): deallocate the block of memory
7/17/2015
Luke Huan Univ. of Kansas
14
An iterator for SORT-MERGE-JOIN
R: An iterator for the left subtree
S: An iterator for the right subtree
R
sort R; sort S;
R.open(); S.open();
r = R.next(); s = S.next();
while (r != NULL && s != NULL) {
while (r.c < s.c)
r = R.next();
if (r.c = s.c) {
while (r.c = s.c) {
output rs;
r = R.next();
}
"rewind" r to first tuple of R where r.c = s.c;
s = S.next();
}
while (r.c > s.c)
s = S.next();
}
7/17/2015
SORT-MERGE-JOIN
Luke Huan Univ. of Kansas
S
Complexity: some of the
DBMSs (e.g. pgSQL) predate
C++!
15
Query optimization


One logical plan ! “best” physical plan
Questions




How to enumerate possible plans
How to estimate costs
How to pick the “best” one
Often the goal is not getting the optimum plan, but
instead avoiding the horrible ones
Any of these will do
1 second
7/17/2015
1 minute
1 hour
Luke Huan Univ. of Kansas
16
Plan enumeration in relational algebra


Apply relational algebra equivalences
Join reordering: £ and ! are associative and
commutative (except column ordering, but that is
unimportant)
!
=
!
R
7/17/2015
T
S
!
!
S
=
!
T
R
Luke Huan Univ. of Kansas
!
S
R
=
…
T
17
More relational algebra equivalences




Convert ¾p-£ to/from !p: ¾p(R £ S) = R !p S
Merge/split ¾’s: ¾p1(¾p2 R) = ¾p1 Æ p2 R
Merge/split ¼’s: ¼L1(¼L2 R) = ¼L1 R, where L1 µ L2
Push down/pull up ¾:
¾p Æ pr Æ ps (R !p’ S) = (¾pr R) !p Æ p’ (¾ps S), where




Push down ¼: ¼L (¾p R) = ¼L (¾p (¼L L’ R)), where


pr is a predicate involving only R columns
ps is a predicate involving only S columns
p and p’ are predicates involving both R and S columns
L’ is the set of columns referenced by p that are not in L
Many more (seemingly trivial) equivalences…

Can be systematically used to transform a plan to new ones
7/17/2015
Luke Huan Univ. of Kansas
18
Relational query rewrite example
¼title
¾Student.name=“Bart” Æ Student.SID = Enroll.SID Æ Enroll.CID = Course.CI
£
£
Course
Student Enroll
¼title
¾Enroll.CID = Course.CID
£
Push down ¾
Convert ¾p-£ to !p
¼title
!Enroll.CID = Course.CID
Course
¾Student.SID = Enroll.SID
£
Course
Enroll
!Student.SID = Enroll.SID
¾Student.name = “Bart”
Enroll
Student
¾name = “Bart”
7/17/2015
Student
Luke Huan Univ. of Kansas
19
Heuristics-based query optimization


Start with a logical plan
Push selections/projections down as much as possible



Join smaller relations first, and avoid cross product



Why? Reduce the size of intermediate results
Why not? May be expensive; maybe joins filter better
Why? Reduce the size of intermediate results
Why not? Size depends on join selectivity too
Convert the transformed logical plan to a physical plan
(by choosing appropriate physical operators)
7/17/2015
Luke Huan Univ. of Kansas
20
SQL query rewrite
More complicated—subqueries and views divide a
query into nested “blocks”





Processing each block separately forces particular join
methods and join order
Even if the plan is optimal for each block, it may not be
optimal for the entire query
Unnest query: convert subqueries/views to joins
We can just deal with select-project-join queries

Where the clean rules of relational algebra apply
7/17/2015
Luke Huan Univ. of Kansas
21
SQL query rewrite example


SELECT name
FROM Student
WHERE SID = ANY (SELECT SID FROM Enroll);
SELECT name
FROM Student, Enroll
WHERE Student.SID = Enroll.SID;


Wrong—consider two Bart’s, each taking two classes
SELECT name
FROM (SELECT DISTINCT Student.SID, name
FROM Student, Enroll
WHERE Student.SID = Enroll.SID);

Right—assuming Student.SID is a key
7/17/2015
Luke Huan Univ. of Kansas
22
Heuristics- vs. cost-based
optimization

Heuristics-based optimization


Apply heuristics to rewrite plans into cheaper ones
Cost-based optimization


Rewrite logical plan to combine “blocks” as much as
possible
Optimize query block by block




Enumerate logical plans (already covered)
Estimate the cost of plans
Pick a plan with acceptable cost
Focus: select-project-join blocks
7/17/2015
Luke Huan Univ. of Kansas
23
Recap of Query Processing in DBMS



Parser: SQL ! parse tree
Rewriter: parse tree ! Logical plan
Next:
 Optimizer: logical plan ! Physcial plan
7/17/2015
Luke Huan Univ. of Kansas
24
Cost estimation
PROJECT (title)
Physical plan example:
MERGE-JOIN (CID)
SORT (CID) SCAN (Course)
MERGE-JOIN (SID)
Input to SORT(CID):
FILTER (name = “Bart”)
SCAN (Student)

SCAN (Enroll)
We have: cost estimation for each operator

Example: SORT(CID) takes log2B(input) £ B(input)


SORT (SID)
But what is B(input)?
We need: size of intermediate results
7/17/2015
Luke Huan Univ. of Kansas
25
Selections with equality predicates


Q: ¾A = v R
Suppose the following information is available



Assumptions



Size of R: |R|
Number of distinct A values in R: |¼A R|
Values of A are uniformly distributed in R
Values of v in Q are uniformly distributed over all R.A
values
|Q| ¼ |R| ⁄ |¼A R|

Selectivity factor of (A = v) is 1 ⁄ |¼A R|
7/17/2015
Luke Huan Univ. of Kansas
26
Conjunctive predicates


Q: ¾A = u and B = v R
Additional assumptions

(A = u) and (B = v) are independent


No “over”-selection


Counterexample: major and advisor
Counterexample: A is the key
|Q| ¼ |R| ⁄ (|¼A R| · |¼B R|)

Reduce total size by all selectivity factors
7/17/2015
Luke Huan Univ. of Kansas
27
Negated and disjunctive predicates

Q: ¾A  v R

|Q| ¼ |R| · (1 – 1 ⁄ |¼A R|)


Selectivity factor of : p is (1 – selectivity factor of p)
Q: ¾A = u or B = v R

|Q| ¼ |R| · (1 ⁄ |¼A R| + 1 ⁄ |¼B R|)?


No! Tuples satisfying (A = u) and (B = v) are counted twice
|Q| ¼ |R| · (1 – (1 – 1 ⁄ |¼A R|) · (1 – 1 ⁄ |¼B R|))

7/17/2015
Intuition: (A = u) or (B = v) is equivalent to
: ( : (A = u) AND : (B = v))
Luke Huan Univ. of Kansas
28
Cost estimation: summary


Using similar ideas, we can estimate the size of projection,
duplicate elimination, union, difference, aggregation (with
grouping)
Lots of assumptions and very rough estimation



Accurate estimate is not needed
Maybe okay if we overestimate or underestimate consistently
May lead to very nasty optimizer “hints”
SELECT * FROM Student WHERE GPA > 3.9;
SELECT * FROM Student WHERE GPA > 3.9 AND GPA >
3.9;

Not covered: better estimation using histograms
7/17/2015
Luke Huan Univ. of Kansas
29
Search for the best plan
!


Huge search space
“Bushy” plan example:
!
R2

R1 R3
Just considering different join orders, there are
R4
(2n – 2)! / (n – 1) bushy plans for R1 ! L ! Rn


!
!
R5
30240 for n = 6
And there are more if we consider:



Multiway joins
Different join methods
Placement of selection and projection operators
7/17/2015
Luke Huan Univ. of Kansas
30
Left-deep plans
!
!
R5
!
R4
!
R
3

R2
R1
Heuristic: consider only “left-deep” plans, in which only the left
child can be a join


Tend to be better than plans of other shapes, because many join
algorithms scan inner (right) relation multiple times—you will not
want it to be a complex subtree
How many left-deep plans are there for R1 ! L ! Rn?

Significantly fewer, but still lots— n! (720 for n = 6)
7/17/2015
Luke Huan Univ. of Kansas
31
A greedy algorithm

S1, …, Sn



Say selections have been pushed down; i.e., Si = ¾p Ri
Start with the pair Si, Sj with the smallest estimated size for Si !
Sj
Repeat until no relation is left:
Pick Sk from the remaining relations such that the join of Sk and
the current result yields an intermediate result of the smallest size
Pick most efficient join method
Minimize expected size
Current subplan
7/17/2015
!
Remaining
…, Sk, Sl, Sm, … relations
to be joined
Sk
Luke Huan Univ. of Kansas
32
A dynamic programming approach

Generate optimal plans bottom-up







Pass 1: Find the best single-table plans (for each table)
Pass 2: Find the best two-table plans (for each pair of tables) by
combining best single-table plans
…
Pass k: Find the best k-table plans (for each combination of k tables)
by combining two smaller best plans found in previous passes
…
Rationale: Any subplan of an optimal plan must also be optimal
(otherwise, just replace the subplan to get a better overall plan)
Well, not quite…
7/17/2015
Luke Huan Univ. of Kansas
33
The need for “interesting order”



Example: R(A, B) ! S(A, C) ! T(A, D)
Best plan for R ! S: hash join (beats sort-merge join)
Best overall plan: sort-merge join R and S, and then sort-merge
join with T


Subplan of the optimal plan is not optimal!
Why?


The result of the sort-merge join of R and S is sorted on A
This is an interesting order that can be exploited by later processing
(e.g., join, duplicate elimination, GROUP BY, ORDER BY, etc.)!
7/17/2015
Luke Huan Univ. of Kansas
34
Summary




Relational algebra equivalence
SQL rewrite tricks
Heuristics-based optimization
Cost-based optimization



Need statistics to estimate sizes of intermediate results
Greedy approach
Dynamic programming approach
7/17/2015
Luke Huan Univ. of Kansas
35