IT 244 Database Management System

Download Report

Transcript IT 244 Database Management System

IT 244
Database Management System
Topic 3 : Normalization
Ref : -
A First Course in Database System
(Jeffrey D Ullman, Jennifer Widom) + online.
Different Technique
• Thus far in this course we have discussed Entity
Relationship Modeling as a technique for
designing a database
• There exists another important design
technique, called Normalisation which we shall
now cover .
• Before we discuss the details, let us first
consider the applicability of each of these
techniques
Entity Relationship Modeling
• Entity Relationship Modeling is appropriate when
a database is to be built from scratch
• This arises when a completely new information
system is being built (e.g. a company realizes
that they need to keep track all purchase made
from its suppliers in order to improve efficiency)
• In the above scenario no relations exist (or can
be identified directly), thus we need to model the
requirements in the form of an ER model, and
then convert the latter to the Relational Model
Normalisation
• However, there exists another scenario whereby a
company decides to migrate from a file-based
information system to a database system
• It is in this latter scenario that Normalisation is most
applicable.
• In this scenario it is possible to define relations
directly (i.e. without going through the intermediate
ER modeling stage) by analysing the report
structures needed by the system
Normalisation
• Database normalization is a design technique by which
relational database tables are structured in such a way
as to make them invulnerable to certain types of logical
inconsistencies and anomalies (Wikipedia Free Enc.)
• Normalisation is the technique by which we
decompose unsatisfactory relations into smaller
relations, each of which are "better" in some
sense than the original
• Is the step by step process of organizing data in
a way that eliminates potential problem in
updating and accessing data
• Normalisation can involve up to 5 steps - each
step gives rise to a Normal Form
Normal Form
• Relations can be in:
- 0th Normal Form,
- 1st Normal Form,
- 2nd Normal Form,
- 3rd Normal Form
- and 4th Normal Form
or 5th Normal Form
Which NF better then the other…?
…. A table in third normal form is less open to logical
inconsistencies and anomalies than a table that is only in
second normal form …
0th Normal Form
• Consider relation ZERO
ZERO(supp_no, city, supp_status, (pro_no,qty))
• Above relation is in 0th Normal Form
• A relation in which one or more of its attributes are
non-atomic is said to be in 0th Normal Form
• Basic problem with 0th Normal Form is that it cannot
be represented in a relational database
• Need to convert relation ZERO to 1st Normal Form
1st Normal Form
• FIRST (supp_no, pro_no, city, supp_status, qty)
• Relation FIRST is in 1st Normal Form
• A relation in which all of its attributes are atomic is
said to be in 1st Normal Form
• The first NF addresses the occurrence of repeating
groups in an entity… an entity is said to be in 1NF if
it contains no repeating group.
• K, A, B, C, D, E, F, R1, R2, R3
• K, A, B, C, D, E, F
• Even though FIRST can be represented in a
relational database it suffers from the following
problems
1st Normal Form
• Problems with Relation FIRST
• Insert : We cannot insert a supplier until that
supplier supplies at least one part
• Delete : If we delete the only tuple for a
particular shipment for a supplier, we
also lose supplier information in the
process
• Modify : the city value for a given supplier appears
many times causing update problems
• Solution to these problems is to further normalize - into
2nd Normal Form
Progressing to the 1NF
1. Place all attributes which appear in a
repeating group into a new table;
2. Duplicate in the new table the primary
key of the table from which the repeating
group was extracted;
3. Designate a primary key for each new
table produced. This key will be the
primary key of the original table
concatenated with one or more attributes
in the new table.
2nd Normal Form
• A relation is said to be in 2nd Normal Form if it
has been placed in first NF and if every
data item in the entity is fully functionally
dependent (FD’s) on the primary key of the
entity.
• K1 + K2, A, B, C, D, E, F (F depends only on K2)
• K1 + K2, A, B, C, D, E
• A non-key attribute is any attribute that does
not participate in the primary key
Progressing to the 2NF
1. If a data is fully functionally dependent on only
a part of the primary key, move that data item
and the part of the primary key on which it is
fully functionally dependent to new table;
2. If other data items are functionally dependent
on the same part of the key, place them in the
new table
3. Make the partial primary key value copied from
the original table of the primary key of the new
table
4. Examine each new table to ensure that each
data item is fully functionally dependent on the
primary key, that is that they are in the 2NF
Functional Dependence
• Given a Relation R, an attribute Y of R is
functionally dependent on attribute X if and
only if each X value is associated with only
one Y value
• Y is fully functionally dependent on X if it is
functionally dependent on X and not
functionally dependent on any part of X
Functional Dependence
status
Supp_no
City
Qty
Pro_no
Above diagram is called the Functional Dependency Diagram (FDD)
2nd Normal Form
• From the definition of 2nd NF we get:
SECOND (supp_no, status, city)
SP(supp_no, pro_no, qty)
• Need to verify that these 2 relations
overcome all the problems associated with
relation FIRST
• However, the SECOND/SP structure still
has problems associated with it
2nd Normal Form
• Problems
• Insert - Cannot express the fact that a particular
city has a particular status value until at
least one supplier is located in that city
• Delete - If we delete the only supplier for a
particular city, then we not only lose that
supplier information but we also lose city
info:
• Update - The status value for a given city appears
many times, causing the usual update
problems
• Thus need to normalize further into 3NF
3rd Normal Form
• A relation R is in 3rd NF if and only if it is in 2NF,
and every non-key attribute is non-transitively
dependent on the primary key attribute
• En entity is said to be in 3NF if has been placed
in 2NF and if all fields which are not part of the
primary key are mutually independent , that is
there are no transitive dependencies (where a
field is dependent not only on the key but also
on another field)
• K, A, B, C, D, E (E is also dependent on A)
• K, A, B, C, D
Progressing to 3NF
1. Identify all fields involved in transitive
dependencies;
2. Move all items involved in transitive
dependencies to a new entity
3. Identify a primary key for the new entity
4. Place the primary key for the new entity
as a foreign key on the original entity.
There should be no remaining transitive
dependencies on the original entity.
Transitive Dependence
• If X -> Y and Y -> Z is said to transitively
dependent on X
• This example illustrates that we should not
blindly use the Normalization technique alone,
but should augment it with our real-world
knowledge of the system that we are designing
Boyce-Codd Normal Form
• An entity is said to be in BCNF if it is in 3NF and
every data item on which some other item is fully
functionally dependent is a candidate key.
• 3rd NF does not deal satisfactorily when a relation
has multiple candidate keys. A candidate key is
any key that can act as a primary key.
• A relation is in BCNF if and only if every
determinant is a candidate key. A determinant is
any attribute on which some other attribute is fully
functionally dependent on.
BCNF Example
• Entity X (A, B, C, D, E, F, G, H)
• {A B} and {B C} are both Candidate
primary key.
• Entity X (A + B, D, F, G)
• Entity Y (B + C, E, H)
BCNF Example
• Consider this Relation SSP
Sup_no
sname
Pro_no
Qty
SI
Smith
PI
300
Sl
Smith
P2
200
SI
Smith.
P3
400
SI
Smith
P4
200
Update problems with SSP, even though SSP is in 3rd NF, there are two
candidate keys for SSP:
(sup_no, pro_no) and (sname, pro_no)
Here sname is a determinant, but not a candidate key
Hence relation SSP is not in BCNF SSP can be decomposed into:
SS(sup_no, sname)
SP(sname, pro_no, qty)
Both SS and SP are in BCNF Can be verified that problems with SSP have
now disappeared
Higher Normal Form
• Higher Normal Forms such as 4th NF and
5th NF do exist, but they are mainly of
theoretical interest only
• This is because more than 99 % of relations
which are in 3rd NF or BCNF are also in 4th
NF and 5th NF
Choosing Between ER Model and
Normalisation
• ER modeling is appropriate when the applications being
developed is new
• In this case normalization should be applied once the ER
model is transformed to the Relational Model
• On the other hand the pure normalization approach is
appropriate when the application being developed already
exists
Eg. A company wants to migrate its S/W from a file-based system
to a database system
In this case the database could be designed by applying
normalization techniques to the reports produced by the old
(file_based) system.
Modeling Entities from Relations
• It is possible to construct an ER model form the
relations produced by the normalization process
• The ER model is drawn in three steps
Step 1
For each relation, draw a rectangle to represent the
equivalent entity type, using the relation name.
Step 2
For each pair of entity types, compare the primary keys. If
the key of one is fully contained within the other, draw a
relationship line between them
Step 3
For each pair of entity types, if the key of one is a foreign
key of another draw a relationship line between the entity
types
Example
Customer (Cust#, Name, Address)
Order (Order#, Date, Total, Cust#)
Order Item (Order#, Item#, Quantity)
Stock Item (Item#, Name, Unit Price, Qty_on_hand)
Statement Item (Cust#, Date, Line#, PayType.
PayAmt)
Summary
•
•
•
•
•
•
•
•
Different Technique
Normal Forms (0th – 5th)
Functional Dependences
Transitive Dependence
Boyce-Codd Normal Form
Higher Normal Form
Choosing Between ER Model and Normalisation
Modeling Entities from Relations
Anomalies
Boyce-Codd Normal Form
3rd Normal Form
Anomalies
• Goal of relational schema design is to
avoid anomalies and redundancy.
– Update anomaly : one occurrence of a fact is
changed, but not all occurrences.- should
necessary only one changed.
– Deletion anomaly : valid fact is lost when a
tuple is deleted.- deletion of a record might
delete data that is not intended to be deleted
– Insertion anomaly: Adding a new record into a
relation would require entering another record
to another relation
Example of Bad Design
Drinkers(name, addr, beersLiked, manf, favBeer)
name
Vili
Vili
Tolu
addr
Kalau
???
Manima
beersLiked
VB
Royal
VB
manf
A.B.
R.B
???
favBeer
Ikale
???
VB
Data is redundant, because each of the ???’s can be figured
out by using the FD’s name -> addr favBeer and
beersLiked -> manf.
This Bad Design Also
Exhibits Anomalies
name
Vili
Vili
Tolu
addr
Kalau
Kalau
Manima
beersLiked
VB
Royal
VB
manf
A.B.
R.B
A.B.
favBeer
Ikale
Ikale
VB
• Update anomaly: if Vili is transferred to Fanga,
will we remember to change each of his tuples?
• Deletion anomaly: If nobody likes VB, we lose track
of the fact that A.B manufactures VB.
• Insertion anomaly: If entering of records under beersLiked
require entering of records under manf.
• END of SECTION
Boyce-Codd Normal Form
• We say a relation R is in BCNF if
whenever X ->A is a nontrivial FD that
holds in R, X is a superkey.
– Remember: nontrivial means A is not a
member of set X.
– Remember, a superkey is any superset of a
key (not necessarily a proper superset).
Example
• Drinkers(name, addr, beersLiked, manf,
favBeer)
• FD’s: name->addr favBeer, beersLiked->manf
- Only key is {name, beersLiked}.
- In each FD, the left side is not a superkey.
- Any one of these FD’s shows Drinkers is not
in BCNF
Another Example
•
•
•
•
Beers(name, manf, manfAddr)
FD’s: name->manf, manf->manfAddr
Only key is {name}.
name->manf does not violate BCNF, but
manf->manfAddr does.
Decomposition into BCNF
• Given: relation R with FD’s F.
• Look among the given FD’s for a BCNF
violation X ->B.
– If any FD following from F violates BCNF,
then there will surely be an FD in F itself
that violates BCNF.
• Compute X +.
– Not all attributes, or else X is a superkey.
Decompose R Using X -> B
•
Replace R by relations with schemas:
1.
2.

R1 = X +.
R2 = (R – X +) U X.
Project given FD’s F onto the two new
relations.
1. Compute the closure of F = all nontrivial FD’s that
follow from F.
2. Use only those FD’s whose attributes are all in R1 or
all in R2.
Decomposition Picture
R1
R-X +
X +-X
X
R2
R
Example
•
•
•
•
•
Drinkers(name, addr, beersLiked, manf, favBeer)
F = name->addr, name -> favBeer,
beersLiked->manf
Pick BCNF violation name->addr.
Close the left side: {name}+ = {name, addr, favBeer}.
Decomposed relations:
1. Drinkers1(name, addr, favBeer)
2. Drinkers2(name, beersLiked, manf)
Example, Continued
• We are not done; we need to check Drinkers1
and Drinkers2 for BCNF.
• Projecting FD’s is complex in general, easy
here.
• For Drinkers1(name, addr, favBeer), relevant
FD’s are name->addr and name->favBeer.
– Thus, name is the only key and Drinkers1 is in
BCNF.
Example, Continued
•
For Drinkers2(name, beersLiked, manf),
the only FD is beersLiked->manf, and
the only key is {name, beersLiked}.
– Violation of BCNF.
•
beersLiked+ = {beersLiked, manf}, so we
decompose Drinkers2 into:
1. Drinkers3(beersLiked, manf)
2. Drinkers4(name, beersLiked)
Example, Concluded
•
The resulting decomposition of Drinkers :
1. Drinkers1(name, addr, favBeer)
2. Drinkers3(beersLiked, manf)
3. Drinkers4(name, beersLiked)

Notice: Drinkers1 tells us about drinkers,
Drinkers3 tells us about beers, and Drinkers4
tells us the relationship between drinkers and
the beers they like.
Third Normal Form - Motivation
• There is one structure of FD’s that
causes trouble when we decompose.
• AB ->C and C ->B.
– Example: A = street address, B = city,
= zip code.
• There are two keys, {A,B } and {A,C }.
• C ->B is a BCNF violation, so we must
decompose into AC, BC.
C
We Cannot Enforce FD’s
• The problem is that if we use AC and BC
as our database schema, we cannot
enforce the FD AB ->C by checking FD’s
in these decomposed relations.
• Example with A = street, B = city, and C =
zip on the next slide.
An Unenforceable FD
street
zip
545 Tech Sq. 02138
545 Tech Sq. 02139
city
Cambridge
Cambridge
zip
02138
02139
Join tuples with equal zip codes.
street
city
545 Tech Sq. Cambridge
545 Tech Sq. Cambridge
zip
02138
02139
Although no FD’s were violated in the decomposed relations,
FD street city -> zip is violated by the database as a whole.
3NF Let’s Us Avoid This Problem
• 3rd Normal Form (3NF) modifies the
BCNF condition so we do not have to
decompose in this problem situation.
• An attribute is prime if it is a member of
any key.
• X ->A violates 3NF if and only if X is not
a superkey, and also A is not prime.
Example
• In our problem situation with FD’s
AB ->C and C ->B, we have keys AB and
AC.
• Thus A, B, and C are each prime.
• Although C ->B violates BCNF, it does not
violate 3NF.
What 3NF and BCNF Give You
•
There are two important properties of a
decomposition:
1. Recovery : it should be possible to project
the original relations onto the decomposed
schema, and then reconstruct the original.
2. Dependency preservation : it should be
possible to check in the projected relations
whether all the given FD’s are satisfied.
3NF and BCNF, Continued
• We can get (1) with a BCNF decompsition.
– Explanation needs to wait for relational
algebra.
• We can get both (1) and (2) with a 3NF
decomposition.
• But we can’t always get (1) and (2) with a
BCNF decomposition.
– street-city-zip is an example.
Summary
•
•
•
•
•
•
•
•
•
•
Anomalies
Example of Bad Design
This Bad Design Also Exhibits Anomalies
Boyce-Codd Normal Form
Decomposition into BCNF
Decompose R Using X -> B
Decomposition Picture
Third Normal Form – Motivation
We Cannot Enforce FD’s
An Unenforceable FD