04CM20145Lecture8 - Department of Computer Science
Download
Report
Transcript 04CM20145Lecture8 - Department of Computer Science
Dr Alwyn Barry
Dr Joanna Bryson
CM20145
Further DB Design –
Normalization
Last Time
Database design is an ongoing,
iterative process.
Requirements come from data, user
demands, design issues.
Change occurs:
Corporations & technologies grow.
Programmers & users learn.
Views / security.
Lossless-join decomposition
Now: Science for improving design.
Design Process & Normalization
We assume a schema R is given.
R could have been generated when
converting E-R diagram to a set of tables.
R could have been a single relation
containing all attributes that are of interest
(called universal relation).
Normalization breaks R into smaller
relations.
R could be the result of any ad hoc
design of relations, which we then test
& convert to normal form.
Overview
First Normal Form.
Functional Dependencies.
Second Normal Form.
Third Normal Form.
Boyce-Codd Normal Form.
Fourth Normal Form.
Fifth Normal Form.
Domain Key / Normal Form.
Design Process & Problems.
First Normal Form – 1NF
You aren’t supposed to have more
than one value per attribute of a tuple.
All tuples have the same number of
attributes.
Necessary for a relational database.
Name
Office
Office Hours
Barry
2.23
1pm, 4pm
Bryson
L2.27
BAD
11am, 5pm
Getting Caught Out With 1NF
A domain is atomic if its elements are
considered to be indivisible units.
Examples of non-atomic domains:
Set-valued attributes, composite attributes.
Identifiers like CS101 that can be broken up
into parts.
A relational schema R is in first normal
form if the domains of all attributes of
R are atomic.
Non-atomic values:
complicate storage,
encourage redundancy,
Depend on interpretation built into
application programs.
Are You Atomic?
Atomicity is not an intrinsic property of
the elements of the domain.
Atomicity is a property of how the
elements of the domain are used.
E.g. strings containing a possible delimiter
(here: a space)
cities = “Melbourne Sydney”
(non-atomic: space separated list)
surname = “Fortescue Smythe”
(atomic: compound surname)
E.g. strings encoding two separate fields
bucs_login = cssjjb
If the first two characters are extracted to find the
department, the domain bucs_login is not atomic.
Leads to encoding of information in application
program rather than in the database.
Second Normal Form (2NF)
Violated when a nonkey column is a fact
about part of the primary key.
A column is not fully functionally dependent
on the full primary key.
CUSTOMER-CREDIT in this case:
From Watson
ORDER
ITEMID
CUSTOMERID
QUANTITY
CUSTOMER-CREDIT
Desk
JJB
25
OK
Chair
AMB
3
POOR
ITEM
ORDER
*itemid
…
quantity
…
CUSTOMER
*customerid
customer-credit
…
BAD
FIX
Def: Functional Dependency
Let R be a relation schema
R and R
The functional dependency (FD) holds on R
(“ is FD on ”) iff for any legal relations r(R):
whenever any two tuples t1 and t2 of r agree on
the attributes
they also agree on the attributes .
i.e. (t1) = (t2) (t1) = (t2)
Example: Consider r(A,B) with the following
instance of r:
A: Initials
B: Chore
JJB
Grading
AMB
Setting Tutorials
JJB
Writing Unit Reviews
A B does NOT hold, but B A does hold
Functional Dependencies: Uses
Way to encode “business rules”.
Specify constraints on the set of
legal relations.
We say that F holds on R if all legal
relations on R satisfy the set of FDs
F.
Test relations to see if they are
legal under a given set of FDs.
If a relation r is legal under a set F
of FDs, we say that r satisfies F.
Functional Dependencies
An FD is an assertion about a schema,
not an instance.
If we only consider an instance or a
few instances, we can’t tell if an FD
holds.
Inspecting only a few bird species (e.g.
crows, cardinals and canaries) we might
conclude colour species.
However, this would be a bad FD as we
would find out if we found some ravens.
Thus, identifying FDs is part of the
data modelling process.
Trivial Functional Dependencies
An FD is trivial if it is satisfied by
all instances of a relation
E.g.
customer-name, loan-number
customer-name
customer-name customer-name
In general, is trivial if
Permitting such FDs makes
certain definitions and algorithms
easier to state.
Functional Dependency vs Key
FDs can express the same constraints
we could express using keys:
Superkeys:
K is a superkey for relation schema R if
and only if K R
Candidate keys:
K is a candidate key for R if and only if
K R, and
there is no K’ K such that K’ R
Of course, which candidate key
becomes the primary key is arbitrary.
FDs <> Keys
FDs can represent more information than
keys can on their own.
Consider the following Loan-info-schema:
Loan-info-schema = (customer-name, loan-number,
branch-name, amount).
We expect these FDs to hold:
loan-number amount
loan-number branch-name
We could try to express this by making loannumber the key, however the following FD
does not hold:
loan-number customer-name
Incidentally, this isn’t a very good table!
(¬2NF)
FD Closure
Given a set F of FDs, other FDs are logically
implied.
E.g. If A B and B C, we can infer that A C
The set of all FDs implied by F is the closure of
F, written F+ .
Find F+ by applying Armstrong’s Axioms:
if , then
if , then
if , and , then
(reflexivity)
(augmentation)
(transitivity)
Additional rules (derivable from Armstrong’s
Axioms):
If and holds, then holds (union)
If holds, then holds and holds
(decomposition)
If holds and holds, then holds
(pseudotransitivity)
Bad Decomposition Example
(From Last Time)
A Non Lossless-Join Decomposition
R = (A, B) R1 = (A), R2 = (B)
A B
A
B
A
B
1
2
A(r)
B(r)
1
2
1
2
1
2
1
r
A (r) ⋈ B (r)
Thus, r is different to A (r) ⋈ B (r)
So A,B is not a lossless-join
decomposition of R.
FDs & Lossless Decomposition
All attributes of an original schema (R)
must appear in the decomposition (R1,
R2):
R = R 1 R2
Lossless-join decomposition.
For all possible relations r on schema R
r = R1 (r) ⋈ R2 (r)
A decomposition of R into R1 and R2 is
lossless-join if and only if at least one
of the following dependencies is in F+:
R1 R2 R1
R1 R2 R2
Second Normal Form (2NF)
Violated when a nonkey column is a fact
about part of the primary key.
A column is not fully functionally dependent
on the full primary key.
CUSTOMER-CREDIT in this case:
From Watson
ORDER
ITEMID
CUSTOMERID
QUANTITY
CUSTOMER-CREDIT
Desk
JJB
25
OK
Chair
AMB
3
POOR
ITEM
ORDER
*itemid
…
quantity
…
CUSTOMER
*customerid
customer-credit
…
BAD
FIX
Third Normal Form (3NF)
Violated when a nonkey column is a
fact about another nonkey column.
A column is not fully functionally
dependent on the primary key.
R is 3NF iff R is 2NF and has no
transitive dependencies.
EXCHANGE RATE violates this.
FIX
STOCK
STOCK CODE
NATION
GOOG
USA
NOK
FIN
EXCHANGE RATE
0.67
BAD
0.46
NATION
STOCK
*nation code
nation name
exchange rate
*stock code
firm name
stock price
stock quantity
stock dividend
stock PE
Boyce-Codd (BCNF)
Arises when a table:
has multiple candidate keys,
the candidate keys are composite,
the candidate keys overlap.
R is BCNF iff every determinant is a cand. key.
E.g. Assume one consultant per problem per client,
and one problem per consultant.
If client-problem is the primary key, how do you add
a new consultant?
Like 3NF but now worry about all fields.
ADVISOR
BAD
CLIENT
PROBLEM
CONSULTANT
Alpha
Marketing
Gomez
Alpha
Production
Raginiski
CLIEN T
CLIEN T-
P ROBLEM
P ROBLEM
*clientno
…
FIX
* problemcode
*cltp ro bdate
…
CON SULTA NT
* consultid
…
…
Design Goals & their discontents
Goals for a relational database design:
eliminate redundancies by decomposing
relations,
must be able to recover original data using
lossless joins,
prefer not to loose dependencies.
BCNF:
no redundancies,
no guarantee of dependency preservation.
3NF:
dependency preservation,
but possible redundancies.
Fourth normal form (4NF)
A row should not contain two or more
independent multivalued facts.
4NF iff BCNF & no non-trivial multivalued dependencies.
Multivalued dependency means the value of
one attributed determines a set of values for
another.
STUDENT
STUDENTID
BAD
STUDENTSPORT
proficiency
…
SPORT
SUBJECT
…
50
Football
English
…
50
Football
Music
…
50
Tennis
Botany
…
SPORT
50
Karate
Botany
…
*sportcode
…
STUDENT
ENROLMENT
*studentid
…
*sectionno
…
FIX
SUBJECT
*subjectid
…
Fifth normal form (5NF)
5NF iff a relation has no join
dependency.
The schemas R1, R2,.., Rn have a
join dependency over R if they
define a lossless-join
decomposition over R.
This is way too complicated, don’t
worry about it.
Domain Key Normal Form
Every constraint on the relation
must be a logical consequence of
the domain constraints and the
key constraints that apply to the
relation.
Key: unique identifier.
Constraint: rule governing attribute
values.
Domain: set of values of the same
data type.
No known algorithm gives DK/NF.
E-R Model and Normalization
When an E-R diagram is carefully
designed, identifying all entities
correctly, the tables generated should
not need further normalization.
However, in a real (imperfect) design
there can be FDs from non-key
attributes of an entity to other attributes
of the entity.
The keys identified in E-R diagrams
might not be minimal - FDs can help us
to identify minimal keys.
FDs from non-key attributes of a
relationship set are possible, but rare.
Denormalization & Performance
May want to use non-normalized
schema for performance.
E.g. displaying customer-name along with
account-number and balance requires join
of account with depositor.
Alternative 1: Use denormalized relation
containing attributes of account as well as
depositor.
faster lookup.
extra space and extra execution time for updates.
extra coding work for programmer and possibility
of error in extra code.
Alternative 2: use a materialized view
defined as account ⋈ depositor
as above, except less extra coding, errors.
Limits of Normalization
Examples of bad database design, not
caught by normalization.
Good:
earnings(company-id, year, amount)
Bad:
earnings-2000, earnings-2001, earnings2002, etc., on (company-id, earnings)
all are BCNF, but querying across years difficult
needs a new table each year
company-year(company-id, earnings2000,earnings-2001, earnings-2002)
in BCNF, but querying across years difficult
requires new attribute each year
Summary 1 – Rules to Watch
1NF: attributes not atomic.
2NF: non-key attribute FD on part of
key.
3NF: one non-key attribute FD on
another.
Boyce-Codd NF: overlapping but
otherwise independent candidate
keys.
4NF: multiple, independent multivalued attributes.
5NF: join dependency.
Domain Key / NF: all constraints either
domain or key
Summary 2 – Concepts
Functional Dependencies:
Axioms & Closure.
Lossless-join decomposition.
Design Process.
Normalization Problems.
Next: Interfaces and Architectures
Reading & Exercises
Reading
Connolly & Begg Chapter (13, 14)
Silberschatz Chapters 7.
Any other book, the
design/normalization chapter.
Exercises:
Silberschatz
7.1, 7.2, 7.16, 7.23, 7.24, 7.27-29
Next Week
•
Architectures and Implementations
•
Integrity and Security
Slides after and including
this one you are not
responsible for, but I am
saving in case I decide to
use them in the future.
Goal: Formalize “Good Design”
Process:
Decide whether a particular relation R is in
“good” form.
In the case that a relation R is not in
“good” form, decompose it into a set of
relations {R1, R2, ..., Rn} such that:
each relation is in good form,
the decomposition is a lossless-join
decomposition.
Theory:
Constraints on the set of legal relations.
Require that the value for a certain set of
attributes determines uniquely the value
for another set of attributes –
functional dependencies.