Transcript Ddb-L42
Distributed Query Optimization Algorithms
System R and R*
Hill Climbing and SDD-1
L4.2.2. Distributed Query Optimization Algorithms -- 1
System R (Centralized) Algorithm
Simple (one relation) queries are executed
according to the best access path.
Execute joins
Determine the possible ordering of joins
Determine the cost of each ordering
Choose the join ordering with the minimal cost
For joins, two join methods are considered:
Nested loops
Merge join
L4.2.2. Distributed Query Optimization Algorithms -- 2
System R Algorithm -- Example
Names of employees working on the CAD/CAM
project
Assume
EMP has an index on ENO,
ASG has an index on PNO,
PROJ has an index on PNO and an index on PNAME
L4.2.2. Distributed Query Optimization Algorithms -- 3
System R Algorithm -- Example
Choose the best access paths to each relation
EMP: sequential scan (no selection on EMP)
ASG: sequential scan (no selection on ASG)
PROJ: index on PNAME (there is a selection on PROJ based on
PNAME)
Determine the best join ordering
EMP
ASG
PROJ
ASG
PROJ
EMP
PROJ
ASG
EMP
ASG
EMP
PROJ
EMP PROJ
ASG
PROJ EMP
ASG
Select the best ordering based on the join costs evaluated according
to the two methods
L4.2.2. Distributed Query Optimization Algorithms -- 4
System R Example (cont'd)
EMP
EMP
ASG EMP × PROJ ASG
(ASG
PROJ
ASG
EMP ASG
EMP)
PROJ PROJ
PROJ
ASG PROJ × EMP
(PROJ
ASG)
ASG)
EMP
EMP
Best total join order is one of
(ASG
EMP)
PROJ
(PROJ
L4.2.2. Distributed Query Optimization Algorithms -- 5
System R Algorithm
(PROJ
ASG)
EMP has a useful index on the select
attribute and direct access to the join attributes of ASG and
EMP.
Final plan:
select PROJ using index on PNAME
then join with ASG using index on PNO
then join with EMP using index on ENO
L4.2.2. Distributed Query Optimization Algorithms -- 6
System R* Distributed Query Optimization
Total-cost minimization. Cost function includes
local processing as well as transmission.
Algorithm
For each relation in query tree find the best access path
For the join of n relations find the optimal join order
strategy
each local site optimizes the local query processing
L4.2.2. Distributed Query Optimization Algorithms -- 7
Data Transfer Strategies
Ship-whole. entire relation is shipped and stored as
temporary relation. If merge join algorithm is used,
no need for temporary storage, and can be done in
pipeline mode
Fetch-as-needed. this method is equivalent to
semijoin of the inner relation with the outer relation
tuple
L4.2.2. Distributed Query Optimization Algorithms -- 8
Join Strategy 1
External relation R with internal relation S, let LC
be local processing cost, CC be data transfer cost,
let average number of tuples of S that match one
tuple of R be s
Strategy 1. Ship the entire outer relation to the site
of internal relation
TC = LC(get R)
+ CC(size(R))
+ LC(get s tuples from S)*card(R)
L4.2.2. Distributed Query Optimization Algorithms -- 9
Join Strategy 2
Ship the entire inner relation to the site of the outer
relation
TC = LC(get S)
+ CC(size(S))
+ LC(store S)
+ LC(get R)
+ LC(get s tuples from S)*card(R)
L4.2.2. Distributed Query Optimization Algorithms -- 10
Join Strategy 3
Fetch tuples of the inner relation for each tuple of
the outer relation
TC = LC(get R)
+ CC(len(A)) * card(R)
+ LC(get s tuples from S) * card(R)
+ CC(s*len(S))*card(R)
L4.2.2. Distributed Query Optimization Algorithms -- 11
Join Strategy 4
Move both relations to 3rd site and join there
TC = LC(get R)
+ LC(get S)
+ CC(size(S))
+ LC(store S)
+ CC(size(R))
+ LC(get s tuples from S)*card(R)
Conceptually, the algorithm does an exhaustive
search among all alternatives and selects one that
minimizes total cost
L4.2.2. Distributed Query Optimization Algorithms -- 12
Hill Climbing Algorithm - Algorithm
Inputs
query graph, locations of relations, and relation statistics
Initial solution
the least costly among all when the relations are sent to a candidate
result site denoted by ES0, and the site as chosen site
Splits ES0 into
ES1: ship one relation of join to the site of other relation
ES2: these two relations are joined locally and the result is
transmitted to the chosen site
If cost(ES1) + cost(ES2) + LC > cost (ES0) select ES0,
else select ES1 and ES2.
The process can be recursively applied to ES1 and ES2 till no
more benefit occurs
L4.2.2. Distributed Query Optimization Algorithms -- 13
Hill Climbing Algorithm - Example
Relation Size Site
EMP
8
1
PAY
4
2
PROJ
1
3
ASG
10
4
SAL
PROJ
PAY
EMP
PNO
Site1
EMP(8)
Site2
PAY(4)
TITLE
ENO
PNAME=“CAD/CAM”
ES0
Cost = 13
4
Site3
PROJ(1)
1
8
Site4
ASG(10)
ASG
Ignore the local processing cost
Length of tuples is 1 for all relation
L4.2.2. Distributed Query Optimization Algorithms -- 14
ES1
HCA - Example
Solution 1
Cost =
ES0
Cost = 13
Site1
EMP(8)
Site2
PAY(4)
4
Site3
PROJ(1)
1
Site4
ASG(10)
Solution 2
Cost =
Site2
PAY(4
)
TITLE
Site3
PROJ(1)
8
ES1
Site2
PAY(4)
?
Site3
PROJ(1)
ES2
?
Site4
?
ASG(10)
ES3
Site1
EMP(8)
ES2
ES3
Site1
EMP(8)
Site4
ASG(10)
ESo is the
“BEST”
L4.2.2. Distributed Query Optimization Algorithms -- 15
Hill Climbing Algorithm - Comments
Greedy algorithm:
determines an initial
feasible solution and
iteratively tries to improve
it.
If there are local minimas,
it may not find the global
minima
If the optimal solution has
a high initial cost, it won’t
be found since it won’t be
chosen as the initial feasible
solution.
Site1
EMP(8)
Site2
PAY(4)
Site3
PROJ(1)
Site4
ASG(10)
COST =
L4.2.2. Distributed Query Optimization Algorithms -- 16
SDD-1 Algorithm
SDD-1 algorithm generalized the hill-climbing
algorithm to determine ordering of beneficial
semijoins; and uses statistics on the database, called
database profiles.
Cost of semijoin:
Cost (R SJA S) = CMSG + CTR*size(A(S))
Benefit is the cost of transferring irrelevant tuple
Benefit(R SJA S) = (1-SFSJ(S.A)) * size(R) * CTR
A semijoin is beneficial if cost < benefit.
L4.2.2. Distributed Query Optimization Algorithms -- 17
SDD-1: The Algorithm
initialization phase generates all beneficial
semijoins, and an execution strategy that includes
only local processing
most beneficial semijoin is selected; statistics are
modified and new beneficial semijoins are selected
the above step is done until no more beneficial joins
are left
assembly site selection to perform local operations
postoptimization removes unnecessary semijoins
L4.2.2. Distributed Query Optimization Algorithms -- 18
SDD1 - Example
SELECT
FROM
WHERE
AND
*
Relation Card Tup_Len Rel_size
EMP, ASG, PROJ
30
50
1500
EMP.ENO = ASG.ENO EMP
ASG
100
30
3000
ASG.PNO = PROJ.PNO
PROJ
Site 2
ASG
ENO
Site 1
EMP
PNO
Site 3
PROJ
50
40
2000
Relation SFsj Size(PJ(attr))
EMP.ENO 0.3
120
ASG.ENO 0.8
400
ASG.PNO 1.0
400
PROJ.PNO 0.4
200
L4.2.2. Distributed Query Optimization Algorithms -- 19
SDD1 - First Iteration
SJ1: ASG SJ EMP
benefit = (1-0.3)*3000 = 2100;
cost = 120
SJ2: ASG SJ PROJ
benefit = (1-0.4)*3000 = 1800
cost = 200
SJ3: EMP SJ ASG
benefit = (1-0.8)*1500 = 300;
cost = 400
SJ4: PROJ SJ ASG
benefit = 0;
cost = 400
SJ1 is selected
ASG size is reduced to
3000*0.3=900
ASG’ = ASG SJ EMP
Semijoin selectivity
factor is reduced; it is
approximated by
SFSJ(G’.ENO)= 0.8*0.3 =
0.24,
SFSJ(G’PNO)=1.0*0.3
=0.3, size(G’.ENO)=
400*0.3=120,
size(G’.PNO) = 120
L4.2.2. Distributed Query Optimization Algorithms -- 20
SDD-1 - Second & Third Iterations
Second iteration
SJ2: ASG’ SJ PROJ benefit=(10.4)*900=540
cost=200;
SJ3: EMP SJ ASG’; benefit=(10.24)*1500=1140
cost=120
SJ4: PROJ SJ ASG’, benefit=(10.3)*2000=1400
cost=120
SJ4 is selected
PROJ’ = PROJ SJ ASG’
size(PROJ’) = 2000*0.3 = 600
SFSJ(J’)=0.4*0.3=0.12
size(J’.PNO)=200*0.3=60
Third Iteration
SJ2: ASG’ SJ PROJ
benefit=(1-0.12)*900=792
cost=60;
SJ3: EMP SJ ASG’; benefit=(10.24)*1500=1140
cost=120
SJ3 is selected
reduces size of E to
1500*0.24=360
Finally SJ2 is selected, with size
of G as 108
L4.2.2. Distributed Query Optimization Algorithms -- 21
Local Optimization
Each site optimizes the plan to be executed at the
site
A centralized query optimization problem
L4.2.2. Distributed Query Optimization Algorithms -- 22
SDD-1 - Assembly Site Selection
After reduction
EMP is at site 1 with size 360
ASG is at site 2 with size 108
PROJ is at site 3 with size 600
Site 3 is chosen as assembly site
Site1
EMP
SJ4 is removed in post
optimization.
(ASG SJ EMP) SJ PROJ site 3
(EMP SJ ASG) site 3
join at site 3
Site3
PROJ
Site2
ASG
L4.2.2. Distributed Query Optimization Algorithms -- 23