Document 7835885

Download Report

Transcript Document 7835885

Schema Refinement and Normalization
Zachary G. Ives
University of Pennsylvania
CIS 550 – Database & Information Systems
September 25, 2003
Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan
Administrivia




HW2 due now
HW3 handed out
Readings on XML: papers rather than textbook
Will shortly have readings for each project group
 Q: Dinkar’s office hours conflict with 505 on
Monday. Would you prefer them on Friday?
2
Schema Refinement & Design Theory
 ER Diagrams give us a start in logical schema design
 Sometimes need to refine our designs further
 There’s a system and theory for this
 Focus is on redundancy of data
 Causes update, insertion, deletion anomalies
3
Not All Designs are Equally Good
Why is this a poor schema design?
Stuff(sid, name, serno, subj, cid, exp-grade)
And why is this one better?
Student(sid, name)
Course(serno, cid)
Subject(cid, subj)
Takes(sid, serno, exp-grade)
4
Focus on the Bad Design
sid name serno subj cid exp-grade
1 Sam 570103 AI 570
B
23 Nitin 550103 DB 550
A
45
1
Jill
Sam
505103 OS 505
505103 OS 505
A
C
 Certain items (e.g., name) get repeated
 Some information requires that a student be enrolled
(e.g., courses) due to the key
5
Functional Dependencies
Describe “Key-Like” Relationships
A key is a set of attributes where:
If keys match, then the tuples match
A functional dependency (FD) is a generalization:
If an attribute set determines another, written A ! B
then if two tuples agree on attribute set A, they must
agree on B:
sid ! name
What other FDs are there in this data?
 FDs are independent of our schema design choice
6
Formal Definition of FD’s
Def. Given a relation schema R and subsets X, Y of R:
An instance r of R satisfies FD X  Y if,
for any two tuples t1, t2 2 r,
t1[X ] = t2[X] implies t1[Y] = t2[Y]
 For an FD to hold for schema R, it must hold for
every possible instance of r
(Can a DBMS verify this? Can we determine this by looking
at an instance?)
7
General Thoughts on Good Schemas
We want all attributes in every tuple to be determined
by the tuple’s key attributes, i.e. part of a superkey
(for key X  Y, a superkey is a “non-minimal” X)
What does this say about redundancy?
But:
 What about tuples that don’t have keys (other than the entire
value)?
 What about the fact that every attribute determines itself?
8
Armstrong’s Axioms: Inferring FDs
Some FDs exist due to others; can compute using
Armstrong’s axioms:
 Reflexivity:
If Y  X then X  Y
name, sid  name
(trivial dependencies)
 Augmentation: If X  Y then XW  YW
serno  subj so serno, exp-grade  subj, exp-grade
 Transitivity:
If X  Y and Y  Z then X  Z
serno  cid and cid  subj
so serno  subj
9
Armstrong’s Axioms Lead to…
If X  Y and X  Z
then X  YZ
 Pseudotransitivity: If X  Y and WY  Z
then XW  Z
 Decomposition: If X  Y and Z  Y
then X  Z
 Union:
Let’s prove these from Armstrong’s Axioms
10
Closure of a Set of FD’s
Defn. Let F be a set of FD’s.
Its closure, F+,is the set of all FD’s:
{X  Y | X  Y is derivable from F by Armstrong’s
Axioms}
Which of the following are in the closure of our Student-Course
FD’s?
name  name
cid  subj
serno  subj
cid, sid  subj
cid  sid
11
Attribute Closures: Is Something
Dependent on X?
Defn. The closure of an attribute set X, X+, is:
X+ =  {Y | X  Y  F +}
 This answers the question “is Y determined
(transitively) by X?”; compute X+ by:
closure := X;
repeat until no change {
if there is an FD U  V in F
such that U is in closure
then add V to closure}
 Does sid, serno  subj, exp-grade?
12
Equivalence of FD sets
Defn. Two sets of FD’s, F and G, are equivalent if
their closures are equivalent, F + = G +
e.g., these two sets are equivalent:
{XY  Z, X  Y} and
{X  Z, X  Y}
 F + contains a huge number of FD’s
(exponential in the size of the scheme)
 Would like to have smallest “representative” FD
set
13
Minimal Cover
we express
Defn. A FD set F is minimal if:
each FD in
1. Every FD in F is of the form X  A,
simplest form
where A is a single attribute
2. For no X  A in F is:
in a sense,
F – {X  A } equivalent to F
each FD is
“essential”
3. For no X  A in F and Z  X is:
to the cover
F – {X  A }  {Z  A } equivalent to F
Defn. F is a minimum cover for G if F is minimal and is
equivalent to G.
e.g.,
{X  Z, X  Y} is a minimal cover for
{XY  Z, X  Y}
14
More on Closures
If F is a set of FD’s and X  Y  F +
then for some attribute A  Y, X  A  F
+
Proof by counterexample.
Assume otherwise and let Y = {A1,..., An}
Since we assume X  A1, ..., X  An are in F +
then X  A1 ... An is in F + by union rule,
hence, X  Y is in F + which is a contradiction
15
Why Armstrong’s Axioms?
Why are Armstrong’s axioms (or an equivalent rule
set) appropriate for FD’s? They are:
 Consistent: any relation satisfying FD’s in F will satisfy
those in F +
 Complete: if an FD X  Y cannot be derived by
Armstrong’s axioms from F, then there exists some
relational instance satisfying F but not
XY
 In other words, Armstrong’s axioms derive all the
FD’s that should hold
16
Proving Consistency
We prove that the axioms’ definitions must be true
for any instance, e.g.:
 For augmentation (if X  Y then XW  YW):
If an instance satisfies X  Y, then:
 For any tuples t1, t2 r,
if t1[X] = t2[X] then t1[Y] = t2[Y]
by defn.
 If, additionally, it is given that t1[W] = t2[W],
then t1[YW] = t2[YW]
17
Proving Completeness
Suppose X  Y  F + and define a relational instance
r that satisfies F + but not X  Y:
 Then for some attribute A  Y, X  A  F +
 Let some pair of tuples in r agree on X+ but disagree
everywhere else:
X
A
X+
–X
R – X+ – {A}
x1 x2 ... xn a1,1 v1 v2 ... vm w1,1 w2,1...
x1 x2 ... xn a1,2 v1 v2 ... vm w1,2 w2,2...
18
Proof of Completeness cont’d
 Clearly this relation fails to satisfy X  A and X  Y.
We also have to check that it satisfies any FD in F + .
 The tuples agree on only X + .
Thus the only FD’s that might be violated are of the form
X’  Y’ where X’  X+ and Y’ contains attributes in
R – X+ – {A}.
 But if X’  Y’ F+ and X’  X+ then Y’  X+ (reflexivity
and augmentation).
Therefore X’  Y’ is satisfied.
19
Decomposition
 Consider our original “bad” attribute set
Stuff(sid, name, serno, subj, cid, exp-grade)
 We could decompose it into
Student(sid, name)
Course(serno, cid)
Subject(cid, subj)
 But this decomposition loses information about
the relationship between students and courses.
Why?
20
Lossless Join Decomposition
R1, … Rk is a lossless join decomposition of R w.r.t. an FD set F if
for every instance r of R that satisfies F,
R1(r) ⋈ ... ⋈ Rk(r) = r
Consider:
sid
name
serno
subj
cid
exp-grade
1
23
Sam
Nitin
570103
550103
AI
DB
570
550
B
A
What if we decompose on
(sid, name) and (serno, subj, cid, exp-grade)?
21
Testing for Lossless Join
R1, R2 is a lossless join decomposition of R with respect to F
iff at least one of the following dependencies is in F+
(R1  R2)  R1 – R2
(R1  R2)  R2 – R1
So for the FD set:
sid  name
serno  cid, exp-grade
cid  subj
Is (sid, name) and (serno, subj, cid, exp-grade) a lossless
decomposition?
22
Dependency Preservation
Ensures we can “easily” check whether a FD X  Y
is violated during an update to a database:
 The projection of an FD set F onto a set of attributes Z,
FZ is
{X  Y | X  Y  F +, X  Y  Z}
i.e., it is those FDs local to Z’s attributes
 A decomposition R1, …, Rk is dependency preserving if
F + = (FR1 ... FRk)+
The decomposition hasn’t “lost” any essential FD’s, so we
can check without doing a join
23
Example of Lossless and
Dependency-Preserving Decompositions
Given relation scheme
R(name, street, city, st, zip, item, price)
And FD set name  street, city
street, city  st
street, city  zip
name, item  price
Consider the decomposition
R1(name, street, city, st, zip) and R2(name, item, price)
 Is it lossless?
 Is it dependency preserving?
What if we replaced the first FD by name, street  city?
24
Another Example
Given scheme: R(sid, fid, subj)
and FD set: fid  subj
sid, subj  fid
Consider the decomposition
R1(sid, fid) and R2(fid, subj)
 Is it lossless?
 Is it dependency preserving?
25
FD’s and Keys
 Ideally, we want a design s.t. for each nontrivial
dependency X  Y, X is a superkey for some
relation schema in R
 We just saw that this isn’t always possible
 Hence we have two kinds of normal forms
26
Two Important Normal Forms
Boyce-Codd Normal Form (BCNF). For every relation
scheme R and for every X  A that holds over R,
either A  X (it is trivial) ,or
or X is a superkey for R
Third Normal Form (3NF). For every relation scheme
R and for every X  A that holds over R,
either A  X (it is trivial), or
X is a superkey for R, or
A is a member of some key for R
27
Normal Forms Compared
 BCNF is preferable, but sometimes in conflict with
the goal of dependency preservation
 It’s strictly stronger than 3NF
 Let’s see algorithms to obtain:
 A BCNF lossless join decomposition
 A 3NF lossless join, dependency preserving decomposition
28
BCNF Decomposition Algorithm
(from Korth et al.; our book gives recursive version)
result := {R}
compute F+
while there is a schema Ri in result that is not in BCNF
{
let A  B be a nontrivial FD on Ri
s.t. A  Ri is not in F+
and A and B are disjoint
result:= (result – Ri)  {(Ri - B), (A,B)}
}
29
3NF Decomposition Algorithm
Let F be a minimal cover
i:=0
for each FD A  B in F {
if none of the schemas Rj, 1 j  i, contains AB
{
increment i
Ri := (A, B)
}
}
if no schema Rj, 1  j  i contains a candidate key for R {
increment i
Ri := any candidate key for R
}
return (R1, …, Ri)
Build dep.preserving
decomp.
Ensure
lossless
decomp.
30
Summary
 We can always decompose into 3NF and get:
 Lossless join
 Dependency preservation
 But with BCNF we are only guaranteed lossless joins
 BCNF is stronger than 3NF: every BCNF schema is
also in 3NF
 The BCNF algorithm is nondeterministic, so there is
not a unique decomposition for a given schema R
31