Transcript ppt



CS 728
Advanced Database Systems
Chapter 10
Functional Dependencies & Normalization

10.1




Design Guidelines for Relation Schemas

1. Attributes should have clear meanings
(semantics) and related attributes are grouped
into single entities.

2. Avoid update anomalies by reducing redundant
data.

3. Reduce the NULL values.

4. When relations are joined no spurious tuples
will be generated.
10.2


Semantics of Relation Attributes

Guideline 1:
Design a relation schema so that it is easy to
explain its meaning. Do not combine attributes
from multiple entity types and relationship types
into a single relation.
Design I:

STUDENT(STNO, Name, Address, ANO)
ADVISOR(ANO, Name, Address, Dept)
Design II:

Student-Advisor(STNO, Name, Address, ANO,
A-name, A-address, Dept)
Design I is better when compared with Design II.




10.3


Update Anomalies



Insertion anomalies:
 In design II, if we add a new student to
Student-Advisor, we have to add data related
to that student’s advisor. This information
should be consistent with all other occurrences
of that advisor. Note that in design II all data
related to a particular advisor is repeated a
number of times which equals to the number
of students supervised by that advisor. In
design I, only advisor number is repeated.
 It is difficult to add a new advisor who have
no students yet to the database. This is
because we have to assign nulls to STNO and
STNO is the primary key for Student-Advisor.
10.4



Update Anomalies


Deletion Anomalies
 If we delete the last student associated with a
particular advisor, then that advisor cannot
exist in the database in design II any more.

Modification Anomalies
 If an advisor changes his/her address, say,
then we have to modify his/her address in all
tuples.
 If we miss some tuples, then we will have
several addresses for the same advisor.
10.5


Update Anomalies





Guideline 2:
Design the base relation schemas so that no
insertion, deletion, or modification anomalies
occur.
If any anomalies are present, note them clearly so
that the programs that update the database will
operate correctly.
10.6


NULL values in Tuples





Guideline 3:
Avoid placing attributes in a base relation whose
values may be null.
If nulls are unavoidable, make sure that they
apply in exceptional cases only and do not apply
to majority of tuples in the relation.
10.7


NULL values in Tuples



Problems with Nulls:
 Waste storage space.

Have multiple interpretations (not-applicable,
not-known,…).

Create ambiguities with aggregate functions
(count, avg, …)

Create ambiguities with joins.
10.8



NULL values in Tuples


Example:
 If only 10% of employees have phones, then
Employee(SSN, Name,…., Office-phone) is a
poor design, because 90% of the last column
values will be nulls;

But:
 Employee(SSN, Name, …)
 Phone(SSN, Office-phone)
is a better design.
10.9



Spurious Tuples

Guideline 4:

Design relation schemas so that they can be
joined with equality conditions on attributes that
are either primary keys or foreign keys in a way
which guarantees that no spurious tuples are
generated.
10.10



Spurious Tuples



Emp-Proj(SSN, Pno, Hours, Ename, Pname,
Plocation)
SSN
Pno
Hours
Ename
11
p1
20
Alex
X
Irbid
22
p1
20
John
X
Irbid
22
p2
25
John
Y
Amman
33
p2
25
Bruce
Y
Amman
33
p3
30
Bruce
Z
Amman
10.11
Pname Plocatoin


Spurious Tuples




Emp-locs(Ename, Plocation)
Emp-Proj1(SSN, Pno, hours, Pname, Plocation)
Ename
Plocation
Alex
Irbid
John
Irbid
John
Amman
Bruce
Amman
10.12



Spurious Tuples
Emp-Proj(SSN, Pno, Hours, Pname, Plocation)

SSN
Pno
Hours
11
p1
20
X
Irbid
22
p1
20
X
Irbid
22
p2
25
Y
Amman
33
p2
25
Y
Amman
33
p3
30
Z
Amman
10.13
Pname Plocatoin


Spurious Tuples


Result  Emp-Locs  Emp-proj1
Then: Result(Ename, Plocation, SSN, Pno,
Hours, Pname)
Ename Plocation


SSN
Pno
Hours
Pname
11
p1
20
X
p1
20
X
Alex
Irbid
Alex
Irbid
John
Irbid
11
p1
20
X
John
Irbid
22
p1
20
X
22
10.14


Spurious Tuples



If we combine Emp-Locs and Emp_Proj1 on
Plocation attribute, we will get spurious tuples as
you have noticed in the previous slide. This is
because Plocation is the attribute which combines
the two relations and it is neither a primary key
nor a foreign key in either Emp-Locs or EmpProj1
10.15


Bad Tables (1)

Alternative designs for a product database:

(1)



Products(Prod_no, Prod_name, Price,
Manu_id)
Manufacturers (Manu_id, Manu_name,
Address)
(2)



Prod_Manu (Prod_no, Prod_name, Price,
Manu_id, Manu_name, Address)
10.16



Bad Tables (2)

Problems with the second design:

Redundancy --- the name and address of each
manufacturer will be repeated once for each
product made by the manufacturer.
 more storage space needed
 potential inconsistency (update anomalies)
10.17



Bad Tables (3)

Insertion anomalies --- a manufacturer's name and
address cannot be recorded in the database if it
does not make at least one product (because
Prod_no is part of the primary key).

Deletion anomalies --- If we delete all products
made by a manufacturer, we will unintentionally
lose track of the manufacturer's name and
address.

The first design does not have similar problems.
The challenge is to identify bad relations and
convert them into good relations.



10.18



Functional Dependencies

R: Relation schema.

r(R): Relation instance

A1, A2, …, An: Attributes which belong to
universal relation.

X, Y: Sets of attributes.
10.19



Functional Dependencies






A functional dependency (X  Y), between X
and Y specifies a constraint on the possible tuples
that can form a relation instance r of R.
The constraint states that for any two tuples t1
and t2 in r such that t1[X] = t2[X], then t1[Y] =
t2[Y]
This means that the Y component of a tuple in r
depends on (or determined by) the values of the X
component of that tuple in r.
Or the values of the X component of a tuple in r
uniquely or functionally determines the values of
Y component.
10.20



Functional Dependencies


X is said to functionally determine Y (or Y is
functionally dependent on X) if for every legal
relation instance r(R), for any two tuples t1 and t2
in r(R), we have
 t1[X] = t2[X] , then t1[Y] = t2[Y]

X  R denotes that X is a subset of the attributes
of R

X Y denotes that X functionally determines Y.
10.21


Functional Dependencies



Note that:
 if X is a candidate key then X  Y is correct
for any subset of attributes of R.
 if X  Y, this does not say whether Y  X is
correct or not.
Student(SSN, STNO, Name, Major)
 SSN  Name
 t1(91910, 980090012, Ahmed, …..)
 t2(91910, 980090012, Ahmed, …..)

STNO  Major
 t10((980090012,…, Math)
 t12(980090012, …, Math)


SSN  {SSN, Name, STNO, Major}
10.22


Functional Dependencies


Manager(SSN, Name, …. Dept)
 SSN  {Name, Dept}
(correct)
 SSN uniquely determines the Name and Dept

Dept  SSN
(correct)
 Dept uniquely determines the SSN

Name  SSN
(incorrect)
1
1
Manager

Manage
10.23
Dept



Functional Dependencies

FD is a property of the meaning or semantics of
attributes

FD is specified as a constraint on R, all
extensions of R (i.e. r(R)) should specify that
constraint (legal extensions); otherwise r(R) are
called illegal extensions

Note that we cannot infer FDs from r(R)
10.24




Diagrammatic Representation of FDs

SSN  STNO, NAME, MAJOR

STNO  SSN, NAME, MAJOR
Student(SSN, STNO, Name, Major)
FD 1
FD 2

10.25


Inference Rules for Functional
Dependencies


Let

F: set of functional dependencies defined on R
 F+ (Closure of F): is the set of all functional
dependencies that can be defined on R
 The closure of F is the set of all FDs that are
logically implied by F
The closure of F is denoted by F+
 F+ = { X  Y | F ╞ X  Y}
A BIG F+ may be derived from a small F
 For R(A, B, C) and F = {A  B, B  C}
 F+ = {A  B, B  C, A  C, A  A, B 
B,C  C, AB  AB, AB  A, AB  B, ... }



10.26


Inference Rules for Functional
Dependencies




Emp-Dept(SSN, Ename, Bdate, Address,
Dnumber, Dname, MGR-SSN)
F={
SSN  {Ename,Bdate,Address,Dnumber},
Dnumber  {Dname, MGR-SSN}}
We can infer the following FDs:
 SSN  SSN
(Reflexive)
 SSN  Ename
(Decomposition)
 SSN {Dname, MGR-SSN} (Transitive)

We usually write F ╞ X  Y to denote that the
FD X Y is inferred from F

X, Y are subsets of attributes
10.27


Functional Dependency



Several equivalent definitions:
 X  Y in R iff for any t1, t2 in r(R), if t1 and
t2 have the same X-value, then t1 and t2 also
have the same Y-value

X  Y in R iff there exist no t1, t2 in r(R)
such that t1 and t2 have the same X-value but
different Y-values

X  Y in R iff for each X-value, there
corresponds to a unique Y-value.
10.28



Functional Dependency


Question: if all X-values are different in all
possible r(R), does X  Y in R?

Theorem 1:
 If X is a superkey of R and Y is any subset of
R, then X  Y in R

Note that X  Y in R is a property that must be
true for all possible legal r(R), not just for the
present r(R)
10.29


Functional Dependency
Example:
A B
a1 b1
a1 b2
a2 b2
a2 b3
a3 b3

C
c1
c1
c2
c2
c2
D
d1
d2
d2
d3
d4

which is true?
AB
AC
CA
AD
BD
AB  D
AB  C
AB  CD
10.30



Identify Functional Dependency


FD created by assertions.
 Employees(SSN, Name, Years_of_emp,
Salary, Bonus)

Assertion:
 Employees hired the same year have the same
salary

This assertion implies:
 Years_of_emp  Salary
10.31


Inference Rules


IR1: Reflexive Rule (‫) منعكس‬
 Y  X, then X Y
 e.g.
 SSN  SSN
 {P#, S#, Qty}  Qty

IR2: Augmentation Rule (‫)زيادة‬
 { X  Y} ╞ XZ  YZ
 e.g.: F = {SSN  Address}
 F ╞ {SSN, Name}  {Address, Name}

10.32


Inference Rules


IR3: Transitive rule: (‫)متعدّي‬
 {X  Y, Y  Z} ╞ X  Z
 e.g.
 F = {SSN  Dnumber, Dnumber  Dname}
 F ╞ SSN  Dname

IR4: Decomposition Rule: (‫)التّحلّل‬
 X  {Y, Z} ╞ X  Y and ╞ X  Z
 e.g.
 F = {SSN  {Ename,BDate,Address, Dnumber}}





F ╞ SSN  Ename
╞ SSN  Bdate
╞ SSN  Address
╞ SSN  Dnumber
10.33


Inference Rules


IR5: Union (Additive) Rule:
 {X  Y, X  Z} ╞ X  {Y, Z}
 e.g.
 F = {SSN  Ename, SSN  Bdate}
 F ╞ SSN  {Ename, Bdate}

IR6: Pseudo-transitive Rule:
 {X  Y, WY  Z} ╞ WX  Z
 e.g.
 F = { SSN  STNO, {Major, STNO}  Name}
 F ╞ {Major, SSN}  Name

10.34



Closure of Attributes

How to determine if F ╞ X  Y is true?

Method 1:
 Compute F+
 If X Y  F+, then F ╞ X Y

Problem:
 Computing F+ could be very expensive!
10.35



Closure of Attributes

Method 2:

Compute X+ : the closure of X under F

X+ denotes the set of attributes that are
functionally determined by X under F.




X+ = { Y | X  Y  F+ }
Theorem:
 X Y  F+ if and only if Y  X+
10.36


Algorithm for Computing



X+

Input:
 a set of FDs F, a set of attributes X in R
Output:
 X+
Begin
 X+ = X;
 Repeat
 oldX+ = X+
 for each FD Y  Z in F do
– if Y  X+ then X+ = X+  Z;
until (X+ = oldX+ )
end



10.37


Algorithm for Computing




X+

Example:
 R(A, B, C, G, H, I) = ABCGHI
 X = AG
 F = {A  B, CG  HI, B  H, A  C }
Compute X+ = (AG)+
Initialization:
 X+ = AG;
10.38


Algorithm for Computing

X+

1st iteration:
 consider A  B:
 since A is a subset of X+, X+ = ABG;

consider CG  HI:
 since CG is not a subset of X+, X+ = ABG;

consider B  H:
 since B is a subset of X+, X+ = ABGH;

consider A C:
 since A is a subset of X+, X+ = ABCGH;


X+ is changed from AG to ABCGH
10.39


Algorithm for Computing

X+

2nd iteration:
 consider A  B:
 since A is a subset of X+, X+ = ABCGH;

consider CG  HI:
 since CG is a subset of X+, X+ = ABCGHI;

consider B  H:
 since B is a subset of X+, X+ = ABCGHI;

consider A  C:
 since A is a subset of X+, X+ = ABCGHI;


X+ is changed from ABCGH to ABCGHI
10.40


Algorithm for Computing
X+

3rd iteration:
 consider each FD in F again, but there is no
change to X+, exit

Result:
 (AG)+ = ABCGHI.




The performance of the algorithm is sensitive to
the order of FDs in F
10.41


Algorithm for Computing

X+

Theorem:
 Given R(A1, ..., An) and a set of FDs F in R,
K  R is a
 superkey if K+ = {A1, ..., An};
 candidate key if K is a superkey and for any
proper subset X of K, X+  {A1, ..., An}.

10.42


Algorithm for Computing

X+

Continue the above example:
 AG is a superkey of R since
 (AG)+ = ABCGHI.


Since A+ = ABCH, G+ = G, neither A nor G is
a superkey.

Hence, AG is a candidate key
10.43



Normal Forms Based on Primary Keys



Normalization of data:
 is a process during which unsatisfactory
relation schemas are decomposed by breaking
up their attributes into smaller relation
schemas that posses desirable properties
 We normalize data for several reasons one of
them is to avoid update anomalies
We have
 First Normal Form (1NF)
 Second Normal Form (2NF)
 Third Normal Form (3NF)
 Boyce-Codd Normal Form (BCNF) (a
stronger definition of 3NF)
 All the above normal forms are based on
functional dependencies.
10.44



Normal Forms Based on Primary Keys
Forth Normal Form (4NF) Based on
multivalued dependencies.
 Fifth Normal Form (5NF) based on join
dependencies.
Aside:
 Student-Adv(STNO, StName, Major,…, Ano,
Aname, …)


 Has several problems


Student(STNO, StName, Major, …, Ano)
Advisor(Ano, Aname, ….)
 Pay the price of expensive joins

10.45


Basic Definitions







R = {A1, …, An)
Superkey: is set of attributes S  R and t1[S] 
t2[S] i =1, ..., n. S may contain redundant
attributes.
Key (K): is a superkey with no redundant
attributes, i.e. removal of any attribute from K
will no longer make it a superkey.
Candidate Key: if a relation has more than one
key, each is called a candidate key. One of these
keys is arbitrarily chosen as a primary key.
Prime Attribute: is an attribute which is a member
of any key (primary or candidate); other attributes
are called nonprime attributes.
10.46


Basic Definitions


Student(SSN, STNO, Name, Address, Salary)
 Superkeys
 {SSN,Name}/{SSN,STNO,Name,Address,Salary
}

Candidate keys
 {SSN, STNO};

Primary Key
 SSN

Prime Attribute:
 SSN and STNO

Nonprime Attributes:
 {Name, Address, Salary}

10.47



1NF (First Normal Form)


A relation schema R is in 1NF if every attribute of
R takes only single and atomic values.

Domains of attributes must include only atomic
values and that the value of any attribute in a
tuple must be a single value from the domain of
that attribute.

In other words, multivalued and composite
attributes are disallowed.
10.48



1NF (First Normal Form)


Student(STNO, StName, {Course(CNO, Ctitle)}

The set braces {} identify the attribute Course as
multivalued

The set braces () identify the attribute Course as a
composite attribute

Because of the last attribute (Course), the Student
relation schema is not in 1NF.
10.49


1NF (First Normal Form)


Student(STNO, StName, {Course(CNO, Ctitle)}

To normalize it to 1NF:
 Student(STNO, StName, CNO, Ctitle)
 this is not a good representation because it has
many disadvantages. Replication of Course
information and student information. Combining
attributes which belong to two separate entities
(namely student and course) into a single relation
schema.

10.50


1NF (First Normal Form)


Student(STNO, StName, {Course(CNO, Ctitle)}


Student(STNO, StName, CNO)
Course(CNO, Ctitle)
 this design suffers from drawback that student
information has to be repeated in the student
relations several times (in fact a number of times
that is equal to the number of courses taken by
that student).

10.51


1NF (First Normal Form)


Student(STNO, StName, {Course(CNO, Ctitle)}



Student(STNO, StName)
Course(CNO, Ctitle)
Study(STNO, CNO)
 This is the best representation because we have
reduced the duplication.

10.52



Second Normal Form (2NF)


Prime attribute --- an attribute in any candidate
key.

Y is fully functionally dependent on X if X  Y
and no proper subset of X functionally determines
Y

FD X  Y is a fully functional dependency if
removal of any attribute A from X means that the
dependency does not hold any more. In other
words (X - {A}) does not determine Y

A FD X  Y is a partial FD if exist some
attribute A which belongs to X and (X - {A}) 
Y still holds
10.53


Second Normal Form (2NF)

General Definition of 2NF
 A relation schema R is in 2NF if every
nonprime attribute A in R is not partially
dependent on any key of R



(i.e. if every nonprime attribute is fully
functionally dependent on every key of R)
10.54


Second Normal Form (2NF)




Emp-Proj(SSN, Pnumber, Hours, Ename, Pname,
Plocation)
 FD1: {SSN, Pnumber}  Hours
(FD)
 FD2: SSN  Ename
(PD)
 FD3: Pnumber  {Pname, Plocation} (PD)
 Because of FD2 and FD3 Emp-Proj is not in
2NF
2NF Normalization
 Emp(SSN, Ename)
 Proj(Pnumber, Pname, Plocation)
 Work(SSN, Pnumber, Hours)
10.55



Second Normal Form (2NF)

Consider
 Bank-Loans (Bank_name, Assets,
Headquarter, Loan_no, Customer_name,
Amount),
 FD1: Bank_name  {Assets, Headquarter}
 FD2: {Bank_name, Loan_no} 
{Customer_name, Amount}
 Because if FD1, Bank-Loans is not in 2NF.

2NF Normalization
 Banks(Bank_name, Assets, Headquarter)
 Loans(Bank_name, Loan_no,
Customer_name, Amount)
10.56



Second Normal Form (2NF)


2NF is not good enough:
 A relation schema in 2NF can still have serious
redundancy problem as well as insertion and
deletion anomalies.

Consider Parts(Part_no, Name, Location, Unit_price,
Manu_id, Manu_name, Manu_Address)
 It is obvious that Parts is in 2NF

Redundancy and various anomalies are
introduced by
 Manu_id  {Manu_name, Manu_Address}

10.57


Second Normal Form (2NF)


Consider
 EMP_DEPT(SSN, EName, BDate, Address,
DNo, DName, DMGRSSN)

It is obvious that EMP_DEPT is in 2NF

Redundancy and various anomalies are
introduced by
 DNo  {DName, DMGRSNN}

10.58



Third Normal Form (3NF)


Transitive Dependency
 a FD X  Y in R is transitive if exist some set
of attributes Z that is not a subset of any key
of R and both X  Z and Z  Y hold.

A relation schema R is in 3NF if it is in 2NF and
no nonprime attribute A of R is transitively
dependent on a key of R
10.59


Third Normal Form (3NF)



A relation schema R is in 3NF if for every FD X
 A, where A is a single attribute, at least one of
the following is true:

(a) A  X

(b) A is a prime;

(c) X is a superkey
(Trivial dependency - Reflexive);
10.60



Third Normal Form (3NF)


R is not in 3NF if a non-prime non-trivially
depends on a non-superkey.

If R in 3NF, it should not have a nonkey attribute
functionally determined by another nonkey
attribute (or by a set of nonkey attributes)
10.61


Third Normal Form (3NF)


General Definition of 3NF:
 A relation schema R is in 3NF if, whenever a
nontrivial functional dependency X  A
holds, either
 (a) X is a superkey of R, or
 (b) A is a prime attribute of R

10.62


Third Normal Form (3NF)

Emp-Dept(SSN, Ename, Bdate, Address, Dnumber,
Dname, DMGR-SSN)
 FD1: SSN  {Ename, Bdate, Address, Dnumber,
Dname, DMGR-SSN}
 FD2: Dnumber  {Dname, DMGR-SSN}
 Emp-Dept is in 1NF, 2NF, but because of








SSN  Dnumber and
Dnumber  Dname and
Dname is a nonprime attribute, and
Dnumber is not a superkey,
Emp-Dept is not in 3NF.
To transform Emp-Dept into 3NF:
 Emp(Enam, SSN, Bdate, Address, Dnumber)
 Dept(Dnumber, Dname, DMG-SSN)
10.63


Third Normal Form (3NF)

Employees (SSN, Name, Age, Salary,
Dept_name, Dept_manager_SSN).

Employees is in 2NF

 since SSN is the only candidate key and every
attribute is fully dependent on it.

Employees is not in 3NF because
 Dept_name  Dept_manager_SSN

10.64


Third Normal Form (3NF)




LOTS(Property-ID#, County-Name, Lot#, Area, Price,
Tax-Rate)
 FD1: Property-ID#  {County-Name, Lot#, Area,
Price, Tax-Rate}
 FD2: {County-Name, Lot#}  {Property-ID#, Area,
Price, Tax-Rate}
 FD3: County-Name  Tax-Rate
 FD4: Area  Price
2 candidate keys: {Property-ID#},{County-Name, Lot#}
LOTS is not in 2NF, because of County-Name  TaxRate
 Tax-Rate is partially dependent on the candidate key
{County-Name, Lot#}.


2NF:
 LOTS1(Property-ID#, County-Name, LOT#, Area,
Price)
 LOTS2(County-Name, Tax-Rate)
10.65


Third Normal Form (3NF)

The relation LOTS1 is not in 3NF, because of
Area  Price
 Area is not a superkey and Price is not prime
attribute
3NF:
 LOTS1A(Property-ID#, County-Name, Lot#,
Area)
 LOTS1B(Area, Price)
 LOT2(County-Name, Tax-Rate)

The above relation schemas are in 3NF



10.66



Boyce-Codd Normal Form (BCNF)

Assume that we have thousands of lots, but twocounties:
 Marion county and Liberty county.
 Lot areas in Marion county are
 .5, .6, .7, .8, .9 and 1 acres

Lot areas in Liberty county are
 1.1,1.2, …, 1.9, 2.0 acres



In this case we have : AREA  County-Name
LOTS1A(Property-ID#, County-Name, Lot#,
Area)
 FD5: Area  County-Name
 Still in 3NF , since County-Name is a prime
attribute
10.67



Boyce-Codd Normal Form (BCNF)


A relation schema R is in BCNF if whenever a
FD X  A holds in R, then X is a superkey of R.
R is in BCNF if for every non-trivial FD, the left
side is a superkey.



LOTS1A-X(Property-ID#, Area, Lot#)
LOTS1A-Y(Area, County-Name)

To describe a relation schema R as “good” it
should be at least in 3NF (general) or BCNF


If R is in BCNF, then R is also in 3NF
However, R in 3NF
R in BCNF
10.68


Normal Forms: Summary


Relationships between different NFs.
1NF
2NF
3NF
BCNF

10.69


Normal Forms: Summary


1NF:
 Attributes should be single-valued and have
atomic domain
 Normalize into 1NF:
 Form a new relations for each non-atomic attribute


2NF:
 2NF removes some insertion anomalies and
deletion anomalies.
 2NF removes some redundancies, namely,
redundancies caused by partial dependencies
on key.
10.70



Normal Forms: Summary


3NF:
 3NF removes all insertion anomalies and
deletion anomalies.
 3NF also removes some redundancies caused
by transitive dependencies.

BCNF:
 achieves all achieved by 3NF.
 BCNF removes all redundancies caused by
FDs.
10.71
