CS186: Introduction to Database Systems
Download
Report
Transcript CS186: Introduction to Database Systems
EECS 647: Introduction to
Database Systems
Instructor: Luke Huan
Spring 2007
Administrative
Homework 4 is due today
Homework 5 is assigned today, it is due April 23.
7/17/2015
Luke Huan Univ. of Kansas
2
A DBMS Review
7/17/2015
Luke Huan Univ. of Kansas
3
Review DBMS Architecture
User/Web Forms/Applications/DBA
query
transaction
Query Parser
Transaction Manager
Query Rewriter
Query Optimizer
Lock Manager
Logging &
Recovery
Query Executor
Files & Access Methods
Buffer Manager
Buffers
Lock Tables
Main Memory
Storage Manager
Storage
7/17/2015
Luke Huan Univ. of Kansas
4
Review Operators in QP
Logical operators:
Physical operators:
what they do
how they do it
Exercise: L: logical operator, P: physical operator
Union
Join
Nested loop join
Sort-merge join
7/17/2015
L
L
P
P
Luke Huan Univ. of Kansas
5
Review Query Tree and Physical Operators
buyer,item
City=‘urbana’
SELECT P.buyer, P.item
FROM Purchase P, Person Q
WHERE P.buyer=Q.name
AND Q.city=‘urbana’
buyer=name
Purchase
(Table scan)
7/17/2015
Luke Huan Univ. of Kansas
(Simple
nested loops)
Person
(Index scan)
6
Preview A query’s trip through the DBMS
SQL query
<Query>
<SFW>
<select-list> <where-cond>
<from-list>
…
…
<table> <table>
Enroll Course
PROJECT (title, SID)
Sort MERGE-JOIN (CID)
Parser
Parse tree
Rewritor
Logical plan
Optimizer
¼title, SID
¾Enroll.CID = Course.CID
£
Enroll Course
Physical plan
SORT (CID) SORT (CID)
Executor
SCAN (Enroll)
SCAN (Course)
Result
7/17/2015
SELECT title, SID
FROM Enroll, Course
WHERE Enroll.CID =
Course.CID;
Luke Huan Univ. of Kansas
7
Today’s Topic
A system view of DBMS query processing process
Parser
Rewriter
Optimizer
Executor
Query optimization
Heuristic based methods
Cost based methods
7/17/2015
Luke Huan Univ. of Kansas
8
Parsing and validation
Parser: SQL ! parse tree
Good old lex & yacc
Detect and reject syntax errors
Validator: validate parse tree
Detect and reject semantic errors
Nonexistent tables/views/columns?
Insufficient access privileges?
Type mismatches?
Also
Examples: AVG(name), name + GPA, Student UNION Enroll
Expand *
Expand view definitions
Information required for semantic checking is found in system
catalog (contains all schema information)
7/17/2015
Luke Huan Univ. of Kansas
9
Logical plan
Nodes are logical operators (often relational algebra operators)
There are many equivalent logical plans
¼title
¾Student.name=“Bart” Æ Student.SID = Enroll.SID Æ Enroll.CID = Course.CID
£
£
Course
Student Enroll
¼title
An equivalent plan: !
Enroll.CID = Course.CID
Course
!Student.SID = Enroll.SID
Enroll
¾name = “Bart”
Student
7/17/2015
Luke Huan Univ. of Kansas
10
Examples of physical plans
Many physical plans for a single query
Equivalent results, but different costs and assumptions!
DBMS query optimizer picks the “best” possible physical plan
¼title
PROJECT
PROJECT(title)
(title)
!Enroll.CID = Course.CID INDEX-NESTED-LOOP-JOIN
MERGE-JOIN (CID) (CID)
Course
Index
Course(CID)
SCANon(Course)
SORT
(CID)
!Student.SID = Enroll.SID INDEX-NESTED-LOOP-JOIN (SID)
MERGE-JOIN (SID)
Enroll
Index on Enroll(SID)
SORT (SID)
¾name = “Bart”
FILTER
(name
=
“Bart”)
INDEX-SCAN (name = “Bart”)
SCAN (Enroll)
Student
Index
on(Student)
Student(name)
SCAN
7/17/2015
Luke Huan Univ. of Kansas
11
Physical plan execution
How are intermediate results passed from child
operators to parent operators?
Temporary files
Compute the tree bottom-up
Children write intermediate results to temporary files
Parents read temporary files
Iterators
7/17/2015
Do not materialize intermediate results
Children pipeline their results to parents
Luke Huan Univ. of Kansas
12
Iterator interface
Every physical operator maintains its own execution
state and implements the following methods:
open(): Initialize state and get ready for processing
getNext(): Return the next tuple in the result (or a null
pointer if there are no more tuples); adjust state to allow
subsequent tuples to be obtained
close(): Clean up
7/17/2015
Luke Huan Univ. of Kansas
13
An iterator for table scan
State: a block of memory for buffering input R;
a pointer to a tuple within the block
open(): allocate a block of memory
getNext()
If no block of R has been read yet, read the first block from the disk
and return the first tuple in the block
If there is no more tuple left in the current block, read the next block
of R from the disk and return the first tuple in the block
Or the null pointer if R is empty
Or the null pointer if there are no more blocks in R
Otherwise, return the next tuple in the memory block
close(): deallocate the block of memory
7/17/2015
Luke Huan Univ. of Kansas
14
An iterator for SORT-MERGE-JOIN
R: An iterator for the left subtree
S: An iterator for the right subtree
R
sort R; sort S;
R.open(); S.open();
r = R.next(); s = S.next();
while (r != NULL && s != NULL) {
while (r.c < s.c)
r = R.next();
if (r.c = s.c) {
while (r.c = s.c) {
output rs;
r = R.next();
}
"rewind" r to first tuple of R where r.c = s.c;
s = S.next();
}
while (r.c > s.c)
s = S.next();
}
7/17/2015
SORT-MERGE-JOIN
Luke Huan Univ. of Kansas
S
Complexity: some of the
DBMSs (e.g. pgSQL) predate
C++!
15
Query optimization
One logical plan ! “best” physical plan
Questions
How to enumerate possible plans
How to estimate costs
How to pick the “best” one
Often the goal is not getting the optimum plan, but
instead avoiding the horrible ones
Any of these will do
1 second
7/17/2015
1 minute
1 hour
Luke Huan Univ. of Kansas
16
Plan enumeration in relational algebra
Apply relational algebra equivalences
Join reordering: £ and ! are associative and
commutative (except column ordering, but that is
unimportant)
!
=
!
R
7/17/2015
T
S
!
!
S
=
!
T
R
Luke Huan Univ. of Kansas
!
S
R
=
…
T
17
More relational algebra equivalences
Convert ¾p-£ to/from !p: ¾p(R £ S) = R !p S
Merge/split ¾’s: ¾p1(¾p2 R) = ¾p1 Æ p2 R
Merge/split ¼’s: ¼L1(¼L2 R) = ¼L1 R, where L1 µ L2
Push down/pull up ¾:
¾p Æ pr Æ ps (R !p’ S) = (¾pr R) !p Æ p’ (¾ps S), where
Push down ¼: ¼L (¾p R) = ¼L (¾p (¼L L’ R)), where
pr is a predicate involving only R columns
ps is a predicate involving only S columns
p and p’ are predicates involving both R and S columns
L’ is the set of columns referenced by p that are not in L
Many more (seemingly trivial) equivalences…
Can be systematically used to transform a plan to new ones
7/17/2015
Luke Huan Univ. of Kansas
18
Relational query rewrite example
¼title
¾Student.name=“Bart” Æ Student.SID = Enroll.SID Æ Enroll.CID = Course.CI
£
£
Course
Student Enroll
¼title
¾Enroll.CID = Course.CID
£
Push down ¾
Convert ¾p-£ to !p
¼title
!Enroll.CID = Course.CID
Course
¾Student.SID = Enroll.SID
£
Course
Enroll
!Student.SID = Enroll.SID
¾Student.name = “Bart”
Enroll
Student
¾name = “Bart”
7/17/2015
Student
Luke Huan Univ. of Kansas
19
Heuristics-based query optimization
Start with a logical plan
Push selections/projections down as much as possible
Join smaller relations first, and avoid cross product
Why? Reduce the size of intermediate results
Why not? May be expensive; maybe joins filter better
Why? Reduce the size of intermediate results
Why not? Size depends on join selectivity too
Convert the transformed logical plan to a physical plan
(by choosing appropriate physical operators)
7/17/2015
Luke Huan Univ. of Kansas
20
SQL query rewrite
More complicated—subqueries and views divide a
query into nested “blocks”
Processing each block separately forces particular join
methods and join order
Even if the plan is optimal for each block, it may not be
optimal for the entire query
Unnest query: convert subqueries/views to joins
We can just deal with select-project-join queries
Where the clean rules of relational algebra apply
7/17/2015
Luke Huan Univ. of Kansas
21
SQL query rewrite example
SELECT name
FROM Student
WHERE SID = ANY (SELECT SID FROM Enroll);
SELECT name
FROM Student, Enroll
WHERE Student.SID = Enroll.SID;
Wrong—consider two Bart’s, each taking two classes
SELECT name
FROM (SELECT DISTINCT Student.SID, name
FROM Student, Enroll
WHERE Student.SID = Enroll.SID);
Right—assuming Student.SID is a key
7/17/2015
Luke Huan Univ. of Kansas
22
Heuristics- vs. cost-based
optimization
Heuristics-based optimization
Apply heuristics to rewrite plans into cheaper ones
Cost-based optimization
Rewrite logical plan to combine “blocks” as much as
possible
Optimize query block by block
Enumerate logical plans (already covered)
Estimate the cost of plans
Pick a plan with acceptable cost
Focus: select-project-join blocks
7/17/2015
Luke Huan Univ. of Kansas
23
Recap of Query Processing in DBMS
Parser: SQL ! parse tree
Rewriter: parse tree ! Logical plan
Next:
Optimizer: logical plan ! Physcial plan
7/17/2015
Luke Huan Univ. of Kansas
24
Cost estimation
PROJECT (title)
Physical plan example:
MERGE-JOIN (CID)
SORT (CID) SCAN (Course)
MERGE-JOIN (SID)
Input to SORT(CID):
FILTER (name = “Bart”)
SCAN (Student)
SCAN (Enroll)
We have: cost estimation for each operator
Example: SORT(CID) takes log2B(input) £ B(input)
SORT (SID)
But what is B(input)?
We need: size of intermediate results
7/17/2015
Luke Huan Univ. of Kansas
25
Selections with equality predicates
Q: ¾A = v R
Suppose the following information is available
Assumptions
Size of R: |R|
Number of distinct A values in R: |¼A R|
Values of A are uniformly distributed in R
Values of v in Q are uniformly distributed over all R.A
values
|Q| ¼ |R| ⁄ |¼A R|
Selectivity factor of (A = v) is 1 ⁄ |¼A R|
7/17/2015
Luke Huan Univ. of Kansas
26
Conjunctive predicates
Q: ¾A = u and B = v R
Additional assumptions
(A = u) and (B = v) are independent
No “over”-selection
Counterexample: major and advisor
Counterexample: A is the key
|Q| ¼ |R| ⁄ (|¼A R| · |¼B R|)
Reduce total size by all selectivity factors
7/17/2015
Luke Huan Univ. of Kansas
27
Negated and disjunctive predicates
Q: ¾A v R
|Q| ¼ |R| · (1 – 1 ⁄ |¼A R|)
Selectivity factor of : p is (1 – selectivity factor of p)
Q: ¾A = u or B = v R
|Q| ¼ |R| · (1 ⁄ |¼A R| + 1 ⁄ |¼B R|)?
No! Tuples satisfying (A = u) and (B = v) are counted twice
|Q| ¼ |R| · (1 – (1 – 1 ⁄ |¼A R|) · (1 – 1 ⁄ |¼B R|))
7/17/2015
Intuition: (A = u) or (B = v) is equivalent to
: ( : (A = u) AND : (B = v))
Luke Huan Univ. of Kansas
28
Cost estimation: summary
Using similar ideas, we can estimate the size of projection,
duplicate elimination, union, difference, aggregation (with
grouping)
Lots of assumptions and very rough estimation
Accurate estimate is not needed
Maybe okay if we overestimate or underestimate consistently
May lead to very nasty optimizer “hints”
SELECT * FROM Student WHERE GPA > 3.9;
SELECT * FROM Student WHERE GPA > 3.9 AND GPA >
3.9;
Not covered: better estimation using histograms
7/17/2015
Luke Huan Univ. of Kansas
29
Search for the best plan
!
Huge search space
“Bushy” plan example:
!
R2
R1 R3
Just considering different join orders, there are
R4
(2n – 2)! / (n – 1) bushy plans for R1 ! L ! Rn
!
!
R5
30240 for n = 6
And there are more if we consider:
Multiway joins
Different join methods
Placement of selection and projection operators
7/17/2015
Luke Huan Univ. of Kansas
30
Left-deep plans
!
!
R5
!
R4
!
R
3
R2
R1
Heuristic: consider only “left-deep” plans, in which only the left
child can be a join
Tend to be better than plans of other shapes, because many join
algorithms scan inner (right) relation multiple times—you will not
want it to be a complex subtree
How many left-deep plans are there for R1 ! L ! Rn?
Significantly fewer, but still lots— n! (720 for n = 6)
7/17/2015
Luke Huan Univ. of Kansas
31
A greedy algorithm
S1, …, Sn
Say selections have been pushed down; i.e., Si = ¾p Ri
Start with the pair Si, Sj with the smallest estimated size for Si !
Sj
Repeat until no relation is left:
Pick Sk from the remaining relations such that the join of Sk and
the current result yields an intermediate result of the smallest size
Pick most efficient join method
Minimize expected size
Current subplan
7/17/2015
!
Remaining
…, Sk, Sl, Sm, … relations
to be joined
Sk
Luke Huan Univ. of Kansas
32
A dynamic programming approach
Generate optimal plans bottom-up
Pass 1: Find the best single-table plans (for each table)
Pass 2: Find the best two-table plans (for each pair of tables) by
combining best single-table plans
…
Pass k: Find the best k-table plans (for each combination of k tables)
by combining two smaller best plans found in previous passes
…
Rationale: Any subplan of an optimal plan must also be optimal
(otherwise, just replace the subplan to get a better overall plan)
Well, not quite…
7/17/2015
Luke Huan Univ. of Kansas
33
The need for “interesting order”
Example: R(A, B) ! S(A, C) ! T(A, D)
Best plan for R ! S: hash join (beats sort-merge join)
Best overall plan: sort-merge join R and S, and then sort-merge
join with T
Subplan of the optimal plan is not optimal!
Why?
The result of the sort-merge join of R and S is sorted on A
This is an interesting order that can be exploited by later processing
(e.g., join, duplicate elimination, GROUP BY, ORDER BY, etc.)!
7/17/2015
Luke Huan Univ. of Kansas
34
Summary
Relational algebra equivalence
SQL rewrite tricks
Heuristics-based optimization
Cost-based optimization
Need statistics to estimate sizes of intermediate results
Greedy approach
Dynamic programming approach
7/17/2015
Luke Huan Univ. of Kansas
35