Schema Refinement and Normal Forms

Download Report

Transcript Schema Refinement and Normal Forms

Schema Refinement and
Normal Forms
Chapter 19
Raghu Ramakrishnan and J. Gehrke
(second text book)
In Course Pick-up box tomorrow
1
Review
Avoiding the expense of global integrity constraints
e.g.: lno  bname preserved by:
CREATE ASSERTION lno-bname
CHECK ( NOT EXIST
(SELECT *
FROM loan-info l1, loan-info l2
WHERE l1.lno = l2.lno AND
l1.bname <> l2.bname))
Expensive, requires a join for every insertion
Reducing the expense:
1. Determine FD set for loan-info, F
2. Find “minimal set”, G, s.t. F= G
2
Functional Dependencies (FDs)

A functional dependency X  Y holds over relation R
if, for every allowable instance r of R:



An FD is a statement about all allowable relations.



t1  r, t2  r,  X (t1) =  X (t2) implies  Y (t1) =  Y (t2)
i.e., given two tuples in r, if the X values agree, then the Y
values must also agree. (X and Y are sets of attributes.)
Must be identified based on semantics of application.
Given some allowable instance r1 of R, we can check if it
violates some FD f, but we cannot tell if f holds over R!
K is a candidate key for R means that K  R

However, K  R does not require K to be minimal!
3
Reasoning About FDs

Given some FDs, we can usually infer additional FDs:


An FD f is implied by a set of FDs F if f holds
whenever all FDs in F hold.


F  = closure of F is the set of all FDs that are implied by F.
Armstrong’s Axioms (X, Y, Z are sets of attributes):
 Reflexivity: If X  Y, then Y  X



ssn  did, did  lot implies ssn  lot
Augmentation: If X  Y, then XZ  YZ for any Z
Transitivity: If X  Y and Y  Z, then X  Z
These are sound and complete inference rules for FDs!
4
Armstrong’s Axioms
A. Fundamental Rules (W, X, Y, Z: sets of attributes)
1. Reflexivity
If Y X then X  Y
2. Augmentation
If X  Y then WX  WY
3. Transitivity
If X Y and Y  Z then XZ
B. Additional rules (can be proved from A)
4. UNION: If X  Y and X  Z then X  YZ
5. Decomposition: If X  YZ then X  Y, X Z
6. Pseudotransitivity: If X  Y and WY  Z then
WX Z
5
Functional Dependencies
A
a
a
a
b
b
B





C
1
1
5
3
3
D
U
V
W
W
W
ABC
“ AB determines C”
two tuples with the same values for A and B
will also have the same value for C
6
Functional Dependencies (Cont.)


K is a superkey for relation schema R if and only if
KR
K is a candidate key for R if and only if
 K  R, and
 for no   K,   R

Functional dependencies allow us to express
constraints that cannot be expressed using superkeys.
Consider the schema:
bor_loan = (customer_id, loan_number, amount ).
We expect this functional dependency to hold:
loan_number  amount
but would not expect the following to hold:
amount  customer_name
7
Use of Functional Dependencies

We use functional dependencies to:
 test relations to see if they are legal under a given set
of functional dependencies.
• If a relation r is legal under a set F of functional
dependencies, we say that r satisfies F.
 specify constraints on the set of legal relations
• We say that F holds on R if all legal relations on R
satisfy the set of functional dependencies F.

Note: A specific instance of a relation schema
may satisfy a functional dependency even if the
functional dependency does not hold on all
legal instances.
 For example, a specific instance of loan may, by
8
Functional Dependencies (Cont.)

A functional dependency is trivial if it is
satisfied by all instances of a relation
 Example:
• customer_name, loan_number  customer_name
• customer_name  customer_name
 In general,    is trivial if   
9
Another use of FD’s: Schema Design
Example:
R=
bname
Downtown
Downtown
Mianus
Downtown
bcity
Bkln
Bkln
Horse
Bkln
assets
9M
9M
1.7M
9M
cname
Jones
Johnson
Jones
Hayes
lno
L-17
L-23
L-93
L-17
amt
1000
2000
500
1000
R: “Universal relation”
tuple meaning: Jones has a loan (L-17) for $1000 taken out at the Downtown
branch in Bkln which has assets of $9M
Design:
+:
-:
fast queries (no need for joins!)
redudancy:
update anomalies
examples?
deletion anomalies
10
The Evils of Redundancy

A first-cut design
 Universal relation (or few relations)
 Redundantly store some columns in multiple tables

Redundancy is at the root of several problems
associated with relational schemas:


redundant storage, performance (of updates,…) suffers
insert/delete/update anomalies
11
Example: Constraints on Entity Set

Consider relation obtained from Hourly_Emps:


Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)
Notation: We will denote this relation schema by
listing the attributes: SNLRWH


This is really the set of attributes {S,N,L,R,W,H}.
Sometimes, we will refer to all attributes of a relation by
using the relation name. (e.g., Hourly_Emps for SNLRWH)
12
Example (Contd.)
Hourly_Emps relation
S



N
L
R W H
123-22-3666 Attishoo
48 8
10 40
231-31-5368 Smiley
22 8
10 30
131-24-3650 Smethurst 35 5
7
30
434-26-3751 Guldu
35 5
7
32
612-67-4134 Madayan
35 8
10 40
No Fuctional Dependencies
Any instance is legal here (no constraints)
No redundany
13
Example: Constraints on Entity Set
Hourly_Emps relation
S

N
L
R W H
123-22-3666 Attishoo
48 8
10 40
231-31-5368 Smiley
22 8
10 30
131-24-3650 Smethurst 35 5
7
30
434-26-3751 Guldu
35 5
7
32
612-67-4134 Madayan
35 8
10 40
Add Some FDs on Hourly_Emps:


ssn is the key: S  SNLRWH
• Values of SSN have to be unique in the relation
rating determines hrly_wages: R  W
• For two rows that have same value of R, the rows also
have same value for W
• Above relation stores R and W values “redundantly”
14
Example (Contd.)

Hourly_Emps

Problems due to R
W:
 Update anomaly: Can
we change W in just
the 1st tuple of SNLRWH?
 Insertion anomaly: What if
we want to insert an
employee and don’t know
the hourly wage for his
rating?
 Deletion anomaly: If we
delete all employees with
rating 5, we lose the
information about the
wage for rating 5!
S
N
L
R W H
123-22-3666 Attishoo
48 8
10 40
231-31-5368 Smiley
22 8
10 30
131-24-3650 Smethurst 35 5
7
30
434-26-3751 Guldu
35 5
7
32
612-67-4134 Madayan
35 8
10 40
Will 2 smaller tables be better?
15
Example (Contd.)
Will 2 smaller tables be better? Yes
Hourly_Emps2
S
N
Wages
L
R H
123-22-3666 Attishoo
48 8 40
231-31-5368 Smiley
22 8 30
R W
8 10
5 7
131-24-3650 Smethurst 35 5 30
434-26-3751 Guldu
35 5 32
612-67-4134 Madayan
35 8 40
Solution to avoid redundancy: Decomposition to smaller tables
But:
what criteria should the new ‘smaller’ tables satisfy so that
you can stop decomposition? What is a good design?
---- Normal Forms
16
Goals of Decomposition
1. Lossless Joins
Want to be able to reconstruct big (e.g. universal) relation by
joining smaller ones (using natural joins)
(i.e. R1
R2 = R)
2. Dependency preservation
Want to minimize the cost of global integrity constraints based on FD’s
( i.e. avoid big joins in assertions)
3. Redundancy Avoidance/Minimization
Avoid/minimize unnecessary data duplication (the motivation for decomposition)
 Normal Forms
Why important?
LJ : information loss
DP: efficiency (time)
RA: efficiency (space), update anomalies
17
Decomposition
1. Decomposing the schema
R = ( bname, bcity, assets, cname, lno, amt)
R1 = (bname, bcity, assets, cname)
2. Decomposing the instance
bname
Downtown
Downtown
Mianus
Downtown
bcity
Bkln
Bkln
Horse
Bkln
assets
9M
9M
1.7M
9M
bname
Downtown
Downtown
Mianus
Downtown
cname
Jones
Johnson
Jones
Hayes
R = R1 U R2
R1 = (cname, lno, amt)
bcity
Bkln
Bkln
Horse
Bkln
assets
9M
9M
1.7M
9M
cname
Jones
Johnson
Jones
Hayes
cname
Jones
Johnson
Jones
Hayes
lno
L-17
L-23
L-93
L-17
lno
L-17
L-23
L-93
L-17
amt
1000
2000
500
1000
amt
1000
2000
500
1000
18
Dependency Goal #1: lossless joins
A bad decomposition:
bname
Downtown
Downtown
Mianus
Downtown
bname
Downtown
Downtown
Mianus
Downtown
bcity
Bkln
Bkln
Horse
Bkln
bcity
Bkln
Bkln
Horse
Bkln
assets
9M
9M
1.7M
9M
assets
9M
9M
1.7M
9M
cname
Jones
Johnson
Jones
Hayes
cname
Jones
Johnson
Jones
Hayes
cname
Jones
Johnson
Jones
Hayes
lno
L-17
L-23
L-93
L-17
lno
L-17
L-23
L-93
L-17
amt
1000
2000
500
1000
amt
1000
2000
500
1000
=
bname
Downtown
Downtown
Downtown
Mianus
Mianus
Downtown
bcity
Bkln
Bkln
Bkln
Horse
Horse
Bkln
assets
9M
9M
9M
1.7M
1.7M
9M
cname
Jones
Jones
Johnson
Jones
Jones
Hayes
lno
L-17
L-93
L-23
L-17
L-93
L-17
amt
1000
500
2000
1000
500
1000
Problem:
join adds
meaningless tuples
“lossy join”: by
adding noise, have
lost meaningful
information
19
Dependency Goal #1: lossless joins
Is the following decomposition lossless or lossy?
bname
Downtown
Downtown
Mianus
Downtown
assets
9M
9M
1.7M
9M
Ans: Lossless:
cname
Jones
Johnson
Jones
Hayes
lno
L-17
L-23
L-93
L-17
R = R1
lno
L-17
L-23
L-93
bcity
Bkln
Bkln
Horse
amt
1000
2000
500
R2, it has same 4 tuples as original
R1 and R2 share the lno which is the key to R2.
20
Ensuring Lossless Joins
A decomposition of R : R = R1 union R2
Is lossless iff
R1 R2  R1, or
R1  R2  R2
(i.e., intersecting attributes must for a superkey for one
of the resulting smaller relations)
In the previous example, lno is the common attribute
and lno is the key to second relation R2
21
More on Lossless Join
A
 The decomposition of R into X
1
and Y is lossless-join wrt F if and 4
only if the closure of F contains: 7
 X  Y  X, or
 X Y  Y

In particular, the decomposition of
R into R - V and UV is lossless-join
if U  V holds over R (i.e., U is
key of second relation)
B
2
5
2
C
3
6
8
A
1
4
7
1
7
B
2
5
2
2
2
C
3
6
8
8
3
A
1
4
7
B
2
5
2
B
2
5
2
C
3
6
8
22
Decomposition Goal #2: Dependency
preservation
Goal: efficient integrity checks of FD’s
An example w/ no DP:
R = ( bname, bcity, assets, cname, lno, amt)
bname  bcity assets
lno  amt bname
Decomposition: R = R1 U R2
R1 = (bname, assets, cname, lno)
R2 = (lno, bcity, amt)
Lossless but not DP. Why?
Ans: bname bcity assets crosses 2 tables
Goal #2: Attributes in a dependency should be in a single relation
23
Decomposition Goal #3: Redundancy
Avoidance/Minimization
Redundancy
Example:
A
a
e
g
h
m
n
p
B
x
x
y
y
y
z
z
C
1
1
2
2
2
1
1
(1) An FD that exists in the above relation is:
for B=x , y and z
BC
(2) A superkey in the above relation is A, (or any set containing A)
When do you have redundancy?
Ans: when there is some FD, XY covered by a relation
and X is not a superkey
Criteria to determine if a design avoids/minimizes redundancy
 Normal Forms
24
Normal Forms
Criteria to decide whether or not a design is good
 Minimize/avoid the problems due to redundancy
 Examples






First Normal Form
Second Normal Form
Third Normal Form
Boyce-Codd Normal Form
More restrictive
Most important
 BCNF and 3NF
25
Boyce-Codd Normal Form (BCNF)

Reln R with FDs F is in BCNF if, for all X  A in F 



A  X (called a trivial FD), or
X is a super key for R (i.e., X contains a candidate key for R)
In other words, R is in BCNF if the only non-trivial
FDs that hold over R are key constraints.



No dependency in R that can be predicted using FDs alone.
If we are shown two tuples that agree upon
X Y A
the X value, we cannot infer the A value in
one tuple from the A value in the other.
x y1 a
If example relation is in BCNF, the 2 tuples x y2 ?
must be identical (since X is a key).
26
Boyce-Codd Normal Form
A relation schema R is in BCNF with respect to a set F of
functional dependencies if for all functional dependencies in F+ of
the form

where   R and   R, at least one of the following holds:


   is trivial (i.e.,   )
 is a superkey for R
Example schema not in BCNF:
bor_loan = ( customer_id, loan_number, amount )
because loan_number  amount holds on bor_loan but loan_number is
not a superkey
27
Decomposing a Schema into BCNF

Suppose we have a schema R and a non-trivial
dependency   causes a violation of BCNF.
We decompose R into:
( U  )
(R-(-))
In our example that is not in BCNF,
•
•

bor_loan = ( customer_id, loan_number, amount )
with FD loan_number  amount
  = loan_number
  = amount
and bor_loan is replaced by a BCNF
 ( U  ) = ( loan_number, amount )
 ( R - (  -  ) ) = ( customer_id, loan_number )
28
BCNF and Dependency Preservation



Constraints, including functional dependencies, are
costly to check in practice unless they pertain to only
one relation
If it is sufficient to test only those dependencies on
each individual relation of a decomposition in order
to ensure that all functional dependencies hold, then
that decomposition is dependency preserving.
Because it is not always possible to achieve both
BCNF and dependency preservation, we consider a
weaker normal form, known as third normal form.
29
Third Normal Form (3NF)

Reln R with FDs F is in 3NF if, for all X  A in
1. A  X (called a trivial FD), or
2.
3.


X contains a key for R, or
A is part of some key(i.e., candidate key) for R.
Minimality of a key is crucial in third condition above!


F
i.e., Can only have candidate keys
If R is in BCNF, obviously in 3NF.
If R is in 3NF, some redundancy is possible.



(why? Because (3) allows non-key based “X->A” dependencies)
It is a compromise, used when BCNF not achievable (e.g., no ``good’’
decomp, or performance considerations).
Lossless-join, dependency-preserving decomposition of R into a collection of
3NF relations always possible.
30
3NF example
R = (custid, empid, bname, type)
F = {empid  bname}
Other candidate keys {custid, bname}
 Not in BCNF because empid is not a
superkey
 In 3NF because bname is in a candidate key

31
3NF Decomposition Algorithm
Let Fc be a canonical cover for F;
i := 0;
for each functional dependency    in Fc do
if none of the schemas Rj, 1  j  i contains  
then begin
i := i + 1;
Ri :=  
end
if none of the schemas Rj, 1  j  i contains a candidate
key for R
then begin
i := i + 1;
Ri := any candidate key for R;
end
return (R1, R2, ..., Ri)
32
3NF Decomposition Algorithm
(Cont.)

Above algorithm ensures:
 each relation schema Ri is in 3NF
 decomposition is dependency preserving and
lossless-join
33
3NF Decomposition: An Example

Relation schema:
cust_banker_branch = (customer_id, employee_id, branch_name, type )

The functional dependencies for this relation schema
are:
1. customer_id, employee_id  branch_name, type
2. employee_id  branch_name
3. customer_id, branch_name  employee_id

We first compute a canonical cover (or minimal cover) FC
 branch_name is extraneous in the r.h.s. of the 1st dependency
 No other attribute is extraneous, so we get FC =
customer_id, employee_id  type
employee_id  branch_name
customer_id, branch_name  employee_id
34
3NF Decompsition Example (Cont.)

The for loop generates following 3NF schema:
(customer_id, employee_id, type )
(employee_id, branch_name)
(customer_id, branch_name, employee_id)
 Observe that (customer_id, employee_id, type ) contains a candidate key
of the original schema, so no further relation schema needs be added
35
Summary of Schema Refinement
If a relation is in BCNF, it is free of redundancies that
can be detected using FDs. Thus, trying to ensure
that all relations are in BCNF is a good heuristic.
 If a relation is not in BCNF, we can try to decompose
it into a collection of BCNF relations.



Must consider whether all FDs are preserved. If a losslessjoin, dependency preserving decomposition into BCNF is
not possible (or unsuitable, given typical queries), should
consider decomposition into 3NF.
Decompositions should be carried out and/or re-examined
while keeping performance requirements in mind.
36
Theory and practice
Performance tuning:
Redundancy not the sole guide to decomposition
Workload matters too!!
• nature of queries run
• mix of updates, queries
•.....
Workload (mix of queries, updates,…) can influence:
BCNF vs 3NF
may further decompose a BCNF into (4NF)
may denormalize (i.e., undo a decomposition or add new columns
 For optimizing query workloads: Materialized views,…
37