Database Overview

Download Report

Transcript Database Overview

Database Overview
• File Management vs Database Management
• Advantages of Database systems: storage persistence,
programming interface, transaction management
• Three level Data Model
• DBMS Architecture
• Database System Components
• Users classification
File Management System Problems
•
•
•
•
•
•
•
Data redundancy
Data Access: New request-new program
Data is not isolated from the access implementation
Format incompatible data
Concurrent program execution on the same file
Difficulties with security enforcement
Integrity issues
Advantages of Databases
• Persistent Storage – Database not only provides persistent
storage but also efficient access to large amounts of data
• Programming Interface – Database allows users to access
and modify data using powerful query language. It
provides flexibility in data management
• Transaction Management – Database supports a
concurrent access to the data
Three Aspects to Studying DBMS's
1. Modeling and design of databases.
– Allows exploration of issues before committing to an
implementation.
2. Programming: queries and DB operations like update.
– SQL = “intergalactic dataspeak.”
3. DBMS implementation.
.
Definitions
• A database is a collection of stored operational data used
by various applications and/or users by some particular
enterprise or by a set of outside authorized applications and
authorized users
• A DataBase Management System (DBMS) is a software
system that manages execution of users applications to
access and modify database data so that the data security,
data integrity, and data reliability is guaranteed for each
application and each application is written with an
assumption that it is the only application active in the
database.
Three Level Data View –
Data Abstractions
View1
. . .
Conceptual
View
Of Data
Phyisal
Data
Storage
.
.
View k
DBMS Architecture
Logical and Physical Database Components
• Data Definition Language (DDL)
•
•
•
•
•
Data Manipulation Language (DML)
Host Language Interface
Data Administrator
Users
Query Processor
– Compiler
– Optimizer
• Management
– Transaction Manager
– File Manager
– Buffer Manager
– Authorization and Integrity Manager
Logical
Physical
Database Languages
Department
Faculty
Name
Dept
Dept
Chair
SQL
SELECT Chair
FROM Faculty, Department
WHERE Faculty.name = “Ken Noname”
AND Faculty.Dept = Department.Dept
Data definition language (DDL) ~ like type definitions in C or C++
Data Manipulation Language (DML)
Query (SELECT)
UPDATE < relation name >
SET <attribute> = < new-value>
WHERE <condition>
Logical Data Models
• A collection of tools for describing
– data
– data relationships
– data semantics
– data constraints
• Value based models: ER Model, OO Model
• Record Based Models: Relational Model
Entity-Relationship Model
• The enterprise data can be described as a set of entities and
a set of relationships between them.
• Entity – a data that pertains to, or describes some
component of the enterprise
• Each entity is characterized by a set of attributes
• Relationship – describes an interconnection between
different entities
• Entity Set – a set of entities that are characterized by the
same entity definition
• Relationship Set – a set of relationships of the same type
Entity-Relationship Model
Example of schema in the entity-relationship model
Relational Model
• An enterprise is represented as a set of relations
• Domain – is a set of atomic values. Each domain has a
NULL value.
• Data type – Description of a form that domain values can
be represented.
• Relation is a subset of a cartesian product of one or more
domains
• The elements of relations are called tuples. Each element
in the cartesian product is called attribute.
Relational model is good for:
Large amounts of data —> simple operations
Navigate among small number of relations
Difficult Applications for relational model:
• VLSI Design (CAD in general)
• CASE
• Graphical Data
ALU
ADDER
A
FA
CPU
Adder
ALU
ADDER
Bill of Materials or
transitive closure
Attributes
Relational Model
Student-id
Name
192-83-7465
Johnson
019-28-3746
Smith
192-83-7465
Johnson
321-12-3123
Jones
019-28-3746
Smith
Street
City
gpa
• Example of tabular data in the relational model
Alma
Palo Alto
3.6
North
Rye
2.7
Alma
Palo Alto
3.2
Main
Harrison
4.0
North
Rye
3.45
Relational Algebra
Lecture 2
Relational Model
•
•
•
•
•
•
•
•
Basic Notions
Fundamental Relational Algebra Operations
Additional Relational Algebra Operations
Extended Relational Algebra Operations
Null Values
Modification of the Database
Views
Bags and Bag operations
Basic Structure
• Formally, given sets D1, D2, …. Dn a relation r is a subset of
D1 x D2 x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai  Di
• Example:
customer_name = {Jones, Smith, Curry, Lindsay}
customer_street = {Main, North, Park}
customer_city
= {Harrison, Rye, Pittsfield}
Then r = { (Jones, Main, Harrison),
(Smith, North, Rye),
(Curry, North, Rye),
(Lindsay, Park, Pittsfield) }
is a relation over
customer_name , customer_street, customer_city
Relation Schema
• A1, A2, …, An are attributes
• R = (A1, A2, …, An ) is a relation schema
Example:
Customer_schema = (customer_name, customer_street,
customer_city)
• r(R) is a relation on the relation schema R
Example:
customer (Customer_schema)
Database
• A database consists of multiple relations
• Information about an enterprise is broken up into parts,
with each relation storing one part of the information
account : stores information about accounts
depositor : stores information about which customer
owns which account
customer : stores information about customers
• Storing all information as a single relation such as
bank(account_number, balance, customer_name, ..)
results in repetition of information (e.g., two customers
own an account) and the need for null values (e.g.,
represent a customer without an account)
Keys
• Let K  R
• K is a superkey of R if values for K are sufficient to identify
a unique tuple of each possible relation r(R)
– by “possible r ” we mean a relation r that could exist in
the enterprise we are modeling.
– Example: {customer_name, customer_street} and
{customer_name}
are both superkeys of Customer, if no two customers can
possibly have the same name.
• K is a candidate key if K is minimal
Example: {customer_name} is a candidate key for.
• Primary Key
Select Operation – Example
• Relation r
• A=B ^ D > 5 (r)
A
B
C
D


1
7


5
7


12
3


23 10
A
B
C
D


1
7


23 10
Project Operation – Example
• Relation r:
• A,C (r)
A
A
B
C

10
1

20
1

30
1

40
2
C

1

1

1

2
A
=
C

1

1

2
That is, the projection of
a relation on a set of
attributes is a set of tuples
Union Operation – Example
• Relations r, s:
A
B
A
B

1

2

2

3

1
s
r
r  s:
A
B

1

2

1

3
Set Difference Operation – Example
• Relations r, s:
A
B
A
B

1

2

2

3

1
s
r
r – s:
A
B

1

1
Cartesian-Product Operation-Example
Relations r, s:
A
B
C
D
E

1

2




10
10
20
10
a
a
b
b
r
s
r x s:
A
B
C
D
E








1
1
1
1
2
2
2
2








10
10
20
10
10
10
20
10
a
a
b
b
a
a
b
b
Additional Operations
We define additional operations that do not add any power
to the relational algebra, but that simplify common queries.
• Set intersection
• Natural join
• Division
• Assignment
Set-Intersection Operation - Example
• Relation r, s:
A



B
1
2
1
r
• rs
A
B


2
3
s
A
B

2
Natural Join Operation – Example
• Relations r, s:
A
B
C
D
B
D
E





1
2
4
1
2





a
a
b
a
b
1
3
1
2
3
a
a
a
b
b





r
r
s
s
A
B
C
D
E





1
1
1
1
2





a
a
a
a
b





Division Operation – Example
Relations r, s:
r  s:
A


A
B
B











1
2
3
1
1
1
3
4
6
1
2
1
r
2
s
Another Division Example
Relations r, s:
A
B
C
D
E
D
E








a
a
a
a
a
a
a
a








a
a
b
a
b
a
b
b
1
1
1
1
3
1
1
1
a
b
1
1
r
r  s:
A
B
C


a
a


s
Example Queries
• Find the largest account balance
1. Rename account relation as d
2. The query is:
balance(account) - account.balance
(account.balance < d.balance (account x rd (account)))
Example Queries
• Find all customers who have an account at all branches
located in Brooklyn city.
customer-name, branch-name (depositor account)
 branch-name (branch-city = “Brooklyn” (branch))
Extended Relational-Algebra-Operations
• Generalized Projection
• Outer Join
• Aggregate Functions
Generalized Projection
• Extends the projection operation by allowing arithmetic
functions to be used in the projection list.
 F1, F2, …, Fn(E)
• E is any relational-algebra expression
• Each of F1, F2, …, Fn are are arithmetic expressions
involving constants and attributes in the schema of E.
• Given relation credit-info(customer-name, limit, creditbalance), find how much more each person can spend:
customer-name, limit – credit-balance (credit-info)
Aggregate Functions and Operations
• Aggregation function takes a collection of values and
returns a single value as a result.
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
• Aggregate operation in relational algebra
G1, G2, …, Gn g F1( A1), F2( A2),…, Fn( An) (E)
– E is any relational-algebra expression
– G1, G2 …, Gn is a list of attributes on which to group
(can be empty)
– Each Fi is an aggregate function
– Each Ai is an attribute name
Aggregate Operation – Example
• Relation
r:
g sum(c) (r)
A
B
C








7
sum-C
27
7
3
10
Aggregate Operation – Example
• Relation account grouped by branch-name:
branch-name account-number
Perryridge
Perryridge
Brighton
Brighton
Redwood
branch-name
g
balance
A-102
A-201
A-217
A-215
A-222
sum(balance)
400
900
750
750
700
(account)
branch-name
Perryridge
Brighton
Redwood
balance
1300
1500
700
Aggregate Functions
• Result of aggregation does not have a name
– Can use rename operation to give it a name
– For convenience, we permit renaming as part of
aggregate operation
branch-name
g sum(balance) as sum-balance (account)
Outer Join – Example
• Relation loan
•
•
•
•
Employee(ename,str,city)
Works(ename,cname,sal)
Company(cname,city) loan-number branch-name
Downtown
Manages(ename,mname)L-170
L-230
L-260
Redwood
Perryridge
 Relation borrower
customer-name loan-number
Jones
Smith
Hayes
L-170
L-230
L-155
amount
3000
4000
1700
SQL
Lecture 3
SQL
•
•
•
•
•
•
•
•
Data Definition
Basic Query Structure
Set Operations
Aggregate Functions
Null Values
Nested Subqueries
Complex Queries
Views
Data Definition Language
Allows the specification of not only a set of relations but also
information about each relation, including:
• The schema for each relation.
• The domain of values associated with each
attribute.
• Integrity constraints
• The set of indices to be maintained for each
relations.
• Security and authorization information for each
relation.
• The physical storage structure of each relation on
disk.
Basic Query Structure
• A typical SQL query has the form:
select A1, A2, ..., An
from r1, r2, ..., rm
where P
order by
group by
having Q
– Ais represent attributes
– ris represent relations
– P is a predicate.
• This query is equivalent to the relational algebra
expression.
A1, A2, ..., An(P (r1 x r2 x ... x rm))
• The result of an SQL query is a relation.
Set Operations
• Find all customers who have a loan, an account, or both:
(select customer-name from depositor)
union
(select customer-name from borrower)

Find all customers who have both a loan and an account.
(select customer-name from depositor)
intersect
(select customer-name from borrower)
 Find all customers who have an account but no loan.
(select customer-name from depositor)
minus
(select customer-name from borrower)
Aggregate Functions
• These functions operate on the multiset of
values of a column of a relation, and return
a value
avg: average value
min: minimum value
max: maximum value
sum: sum of values
count: number of values
Null Values and Aggregates
• Total all loan amounts
select sum (amount)
from loan
– Above statement ignores null amounts
– result is null if there is no non-null amount, that
is the
• All aggregate operations except count(*)
ignore tuples with null values on the
aggregated attributes.
Nested Subqueries
• SQL provides a mechanism for the nesting
of subqueries.
• A subquery is a select-from-where
expression that is nested within another
query.
• A common use of subqueries is to perform
tests for set membership, set comparisons,
and set cardinality.
Set Comparison
• Find all branches that have greater assets
than some branch located in Brooklyn.
select distinct T.branch-name
from branch as T, branch as S
where T.assets > S.assets and
S.branch-city = ‘Brooklyn’
 Same query using > some clause
select branch-name
from branch
where assets > some
(select assets
from branch
where branch-city = ‘Brooklyn’)
Example Query
• Find the names of all branches that have
greater assets than all branches located in
Brooklyn.
select branch-name
from branch
where assets > all
(select assets
from branch
where branch-city = ‘Brooklyn’)
Test for Empty Relations
• The exists construct returns the value true
if the argument subquery is nonempty.
• exists r  r  Ø
• not exists r  r = Ø
•
Example
Student(name,sport)
Name Sport
Yuri soccer
Yuri baseball
Yuri tennis
Joe
football
Joe
soccer
Jane tennis
Jim tennis
Yuri tennis
Jim football
Find students that play all sports:
Student
..
(Student)
sport
Example
• Student(name,sport)
• Find students that play all sports
• Select distinct id
from students S
where not exists (
(select distinct sport from student)
minus
(select distinct sport from student T
where S.id = T.id)
• Find students that are playing exactly one sport
• Select id
from (select id, count(*)
from student
group by id
having count(*) = 1)
Entity-Relationship Model
Lecture 5
ER Model Components
• Entity Sets
• Attributes
• Relationships
ER Model
title
year
Name
Star-in
Stars
Movies
length
owns
genre
Studios
Name
Address
Address
Relationship
A relationship is a cartesian product of
n  2 entities
{(e1, e2, … en) | e1  E1, e2  E2, …, en  En}
where (e1, e2, …, en) is a relationship
Degree of a Relationship Set
• Refers to number of entity sets that participate in a
relationship set.
• Relationship sets that involve two entity sets are binary (or
degree two). Generally, most relationship sets in a database
system are binary.
• Relationship sets may involve more than two entity sets.
E.g. Suppose employees of a bank may have jobs
(responsibilities) at multiple branches, with different jobs at
different branches. Then there is a ternary relationship set
between entity sets employee, job and branch
• Relationships between more than two entity sets are rare.
Most relationships are binary.
Courses
Students
Enrolls
Instructors
Students
Ann
Sue
Bob
…
…
Courses
CS43005
CS43005
CS43005
…
TAs
Jan
Pat
Jan
Types of Binary Relationships
Many-many
Many-one
Representation of Many-One
Many-one E/R: arrow pointing to “one.”
One-one
Keys for Relationship Sets
• The combination of primary keys of the participating entity
sets forms a super key of a relationship set.
– (customer-id, account-number) is the super key of
depositor
– NOTE: this means a pair of entity sets can have at
most one relationship in a particular relationship set.
• E.g. if we wish to track all access-dates to each
account by each customer, we cannot assume a
relationship for each access. We can use a
multivalued attribute though
• Must consider the mapping cardinality of the relationship
set when deciding the what are the candidate keys
• Need to consider semantics of relationship set in selecting
the primary key in case of more than one candidate key
Converting Multiway to 2-Way
• Create a new connecting E.S. to represent rows of a relationship
set.
– E.g., (Joe's Bar, Bud, $2.50) for the Sells relationship.
• Many-one relationships from the connecting E.S. to the others.
BBP
TheBar
TheBeer
ThePrice
Bars
Beers
Price
Specialization
• within an entity set that are distinctive from other entities in
the set.
• These subgroupings become lower-level entity sets that have
attributes or participate in relationships that do not apply to
the higher-level entity set.
• Depicted by a triangle component labeled ISA (E.g. Topdown design process; we designate subgroupings customer
“is a” person).
• Attribute inheritance – a lower-level entity set inherits all
the attributes and relationship participation of the higherlevel entity set to which it is linked.
Specialization Example
Generalization
• A bottom-up design process – combine a number of entity
sets that share the same features into a higher-level entity
set.
• Specialization and generalization are simple inversions of
each other; they are represented in an E-R diagram in the
same way.
• The terms specialization and generalization are used
interchangeably.
Aggregation
 Consider the ternary relationship works-on, which we saw earlier
 Suppose we want to record managers for tasks performed by an
employee at a branch
E-R Diagram With Aggregation
Weak Entity Sets
• An entity set that does not have a primary key is referred to
as a weak entity set.
• The existence of a weak entity set depends on the existence
of a identifying entity set
– it must relate to the identifying entity set via a total,
one-to-many relationship set from the identifying to the
weak entity set
– Identifying relationship depicted using a double
diamond
• The discriminator (or partial key) of a weak entity set is
the set of attributes that distinguishes among all the entities
of a weak entity set.
• The primary key of a weak entity set is formed by the
primary key of the strong entity set on which the weak
entity set is existence dependent, plus the weak entity set’s
discriminator.
Weak Entity Sets
• We depict a weak entity set by double rectangles.
• We underline the discriminator of a weak entity set with
a dashed line.
• payment-number – discriminator of the payment entity
set
• Primary key for payment – (loan-number, paymentnumber)
Example
student
enrolls
offering
isoffered
requires
course
teaches
instructor
Reduction of an E-R Schema to Tables
• Primary keys allow entity sets and relationship sets
to be expressed uniformly as tables which represent
the contents of the database.
• A database which conforms to an E-R diagram can
be represented by a collection of tables.
• For each entity set and relationship set there is a
unique table which is assigned the name of the
corresponding entity set or relationship set.
• Each table has a number of columns (generally
corresponding to attributes), which have unique
names.
Representing Entity Sets as Tables
• A strong entity set reduces to a table with the
same attributes.
Representing Relationship Sets as Tables
• A many-to-many relationship set is represented as a
table with columns for the primary keys of the two
participating entity sets, and any descriptive
attributes of the relationship set.
• E.g.: table for relationship set borrower
Additional Rules for Translating Relationship into
Relation
If one entity set participates several times in the
relationship with different roles, its key attributes
must be listed as many times and with different
names for each role.
Name
SSN
Student
friends
studies
subject
favorite
Studies(SSN, Name); Favorite(SSN, Name);
Friends(SSN1, SSN2)
Redundancy of Tables

Many-to-one relationship sets that are total on the manyside can be represented by adding an extra attribute to the
many side, containing the primary key of the one side
 Example: We eliminate relation Favorite and we extend
relation for Student as follows:
Student(SSN, Name, Subject.name)
 If, however, the relationship is many-to-many we cannot do
that since it leads to redundancy
For example relation Studies cannot be eliminated since
otherwise we may end up with:
111-222-333 John OS
111-222-333 John DBMS
Representing Weak Entity Sets
 A weak entity set becomes a table that includes a column for
the primary key of the identifying strong entity set
Representing Specialization as Tables
Form a table for the higher level entity
Form a table for each lower level entity set, include
primary key of higher level entity set and local
attributes
table
person
customer
employee
table attributes
name, street, city
name, credit-rating
name, salary
– Drawback: getting information about, e.g., employee
requires accessing two tables
Relations Corresponding to Aggregation
 To represent aggregation, create a table containing
 primary key of the aggregated relationship,
 the primary key of the associated entity set
 Any descriptive attributes
Example
ssn
name
passenger
person
ISA
age
booked
ISA
date
departure
assigned
gate
pilot
instantof
canfly
flight
F#
dtime
#fhrs
plane
atime
man
model
Example
Name
ssn
name
Team
Player
Plays
age
Uses
ISA
Has
Colors
Captain
Likes-Colors
likes
Fan
name
addr
Relational Database Design Theory
Lecture 6
First Normal Form
• Domain is atomic if its elements are considered to be
indivisible units
– Examples of non-atomic domains:
• Set of names, composite attributes
• Identification numbers like CS101 that can be
broken up into parts
• A relational schema R is in first normal form if the
domains of all attributes of R are atomic
• Non-atomic values complicate storage and encourage
redundant (repeated) storage of data
– Example: Set of accounts stored with each customer,
and set of owners stored with each account
– We assume all relations are in first normal form
First Normal Form
• Atomicity is actually a property of how the elements of the
domain are used.
– Example: Strings would normally be considered
indivisible
– Suppose that students are given roll numbers which are
strings of the form CS0012 or EE1127
– If the first two characters are extracted to find the
department, the domain of roll numbers is not atomic.
– Doing so is a bad idea: leads to encoding of information
in application program rather than in the database.
Functional Dependencies
• Constraints on the set of legal relations.
• Require that the value for a certain set of attributes
determines uniquely the value for another set of attributes.
• A functional dependency is a generalization of the notion
of a key.
Functional Dependencies
 Let R(A1, A2, ….Ak) be a relational schema; X and Y
are subsets of {A1, A2, …Ak}. We say that X->Y,
if any two tuples that agree on X, then they agree on
Y.
 Example:
Student(SSN,Name,Addr,subjectTaken,favSubject,Prof)
SSN->Name
SSN->Addr
subjectTaken->Prof
Assign(Pilot,Flight,Date,Departs)
Pilot,Date,Departs->Flight
Functional Dependencies
• A functional dependency X->Y is trivial if it is satisfied by
any relation that includes attributes from X and Y
– E.g.
• customer-name, loan-number  customer-name
• customer-name  customer-name
– In general,    is trivial if   
Closure of a Set of Functional Dependencies
• Given a set F set of functional dependencies, there are certain
other functional dependencies that are logically implied by F.
– E.g. If A  B and B  C, then we can infer that A 
C
• The set of all functional dependencies logically implied by F
is the closure of F.
• We denote the closure of F by F+.
Closure of a Set of Functional Dependencies
• An inference axiom is a rule that states if a relation satisfies
certain FDs, it must also satisfy certain other FDs
• Set of inference rules is sound if the rules lead only to true
conclusions
• Set of inference rules is complete, if it can be used to
conclude every valid FD on R
• We can find all of F+ by applying Armstrong’s Axioms:
– if   , then   
(reflexivity)
– if   , then     
(augmentation)
– if   , and   , then    (transitivity)
• These rules are
– sound and complete
Example
• R = (A, B, C, G, H, I)
F={ AB
AC
CG  H
CG  I
B  H}
• some members of F+
– AH
• by transitivity from A  B and B  H
– AG  I
• by augmenting A  C with G, to get AG  CG
and then transitivity with CG  I
Procedure for Computing F+
• To compute the closure of a set of functional dependencies
F:
F+ = F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F+
for each pair of functional dependencies f1and f2 in F+
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency
to F+
until F+ does not change any further
Closure of Attribute Sets
• Given a set of attributes , define the closure of  under F
(denoted by +) as the set of attributes that are functionally
determined by  under F:
   is in F+    +
• Algorithm to compute +, the closure of  under F
result := ;
while (changes to result) do
for each    in F do
begin
if   result then result := result  
end
Uses of Attribute Closure
There are several uses of the attribute closure algorithm:
• Testing for superkey:
– To test if  is a superkey, we compute +, and check if
+ contains all attributes of R.
• Testing functional dependencies
– To check if a functional dependency    holds (or, in
other words, is in F+), just check if   +.
– That is, we compute + by using attribute closure, and
then check if it contains .
– Is a simple and cheap test, and very useful
• Computing closure of F
– For each   R, we find the closure +, and for each S 
+, we output a functional dependency   S.
Example of Attribute Set Closure
• R = (A, B, C, G, H, I)
• F = {A  B, A  C, CG  H, CG  I, B  H}
• (AG)+
1. result = AG
2. result = ABCG (A  C and A  B)
3. result = ABCGH (CG  H and CG  AGBC)
4. result = ABCGHI (CG  I and CG  AGBCH)
• Is AG a key?
1. Is AG a super key?
1. Does AG  R? == Is (AG)+  R
2. Is any subset of AG a superkey?
1. Does A  R? == Is (A)+  R
2. Does G  R? == Is (G)+  R
Extraneous Attributes
• Consider a set F of functional dependencies and the
functional dependency    in F.
– Attribute A is extraneous in  if A   and
F logically implies {( – A)  } or
Attribute A   is extraneous in  if A   and the set
of functional dependencies
(F – {  })  { ( – A)} logically implies F.
• Example: Given F = {A  C, AB  C }
– B is extraneous in AB  C because {A  C}
logically implies AB C, A C.
• Example: Given F = {A  C, AB  CD}
– C is extraneous in AB  CD since {AB  D,A C}
implies AB  C
Testing if an Attribute is Extraneous
• Consider a set F of functional dependencies and the
functional dependency    in F.
• To test if attribute A   is extraneous in 
1. compute ({} – A)+ using the dependencies in
F
2. check that ({} – A)+ contains A; if it does, A is
extraneous
• To test if attribute A   is extraneous in 
1. compute + using only the dependencies in
F’ = (F – {  })  { ( – A)},
2. check that + contains A; if it does, A is extraneous
Canonical Cover
• Sets of functional dependencies may have redundant
dependencies that can be inferred from the others
– Eg: A  C is redundant in: {A  B, B  C, A 
C}
– Parts of a functional dependency may be redundant
• E.g. : {A  B, B  C, A  CD} can be
simplified to
{A  B, B  C, A  D}
• E.g. : {A  B, B  C, AC  D} can be
simplified to
{A  B, B  C, A  D}
• A canonical cover of F is a “minimal” set of functional
dependencies equivalent to F, having no redundant
dependencies or redundant parts of dependencies
Canonical Cover
(Formal Definition)
• A canonical cover for F is a set of dependencies Fc such that
– F logically implies all dependencies in Fc, and
– Fc logically implies all dependencies in F, and
– No functional dependency in Fc contains an extraneous
attribute, and
– Each left side of functional dependency in Fc is unique.
Canonical Cover
Computation
• To compute a canonical cover for F:
repeat
Use the union rule to replace any dependencies in F
1  1 and 1  1 with 1  1 2
Find a functional dependency    with an
extraneous attribute either in  or in 
If an extraneous attribute is found, delete it from   
until F does not change
Example of Computing a Canonical Cover
• R = (A, B, C)
F = {A  BC
BC
AB
AB  C}
• Combine A  BC and A  B into A  BC
• A is extraneous in AB  C
– Set is now {A  BC, B  C}
• C is extraneous in A  BC
– Check if A  C is logically implied by A  B and the
other dependencies
• Yes: using transitivity on A  B and B  C.
• The canonical cover is:
ABBC
Decomposition
• All attributes of an original schema (R) must appear
in the decomposition (R1, R2):
R = R1  R2
• Lossless-join decomposition.
For all possible relations r on schema R
r = R1 (r) R2 (r)
• A decomposition of R into R1 and R2 is lossless join if
and only if at least one of the following dependencies
is in F+:
– R1  R2  R1
– R1  R2  R2
Normalization Using Functional Dependencies
• When we decompose a relation schema R with a set of
functional dependencies F into R1, R2,.., Rn we want
– Lossless-join decomposition: Otherwise decomposition
would result in information loss.
– Dependency preservation: Let Fi be the set of dependencies
F+ that include only attributes in Ri.
(F1  F2  …  Fn)+ = F+
.
Example
• R = (A, B, C)
F = {A  B, B  C)
– Can be decomposed in two different ways
• R1 = (A, B), R2 = (B, C)
– Lossless-join decomposition:
R1  R2 = {B} and B  BC
– Dependency preserving
• R1 = (A, B), R2 = (A, C)
– Lossless-join decomposition:
R1  R2 = {A} and A  AB
– Not dependency preserving
(cannot check B  C without computing R1
R2)
Testing for Dependency Preservation
• To check if a dependency  is preserved in a
decomposition of R into R1, R2, …, Rn we apply the
following simplified test (with attribute closure done w.r.t. F)
– result = 
while (changes to result) do
for each Ri in the decomposition
t = (result  Ri)+  Ri
result = result  t
– If result contains all attributes in , then the functional
dependency
   is preserved.
• We apply the test on all dependencies in F to check if a
decomposition is dependency preserving
• This procedure takes polynomial time, instead of the
exponential time required to compute F+ and (F1  F2  … 
Fn)+
Boyce-Codd Normal Form
A relation schema R is in BCNF with respect to a set F of functional
dependencies if for all functional dependencies in F+ of the form
  , where   R and   R, at least one of the following holds:
•    is trivial (i.e.,   )
•  is a superkey for R
Example
• R = (A, B, C)
F = {A  B
B  C}
Key = {A}
• R is not in BCNF
• Decomposition R1 = (A, B), R2 = (B, C)
– R1 and R2 in BCNF
– Lossless-join decomposition
– Dependency preserving
Testing for BCNF
• To check if a non-trivial dependency   causes a
violation of BCNF
1. compute + (the attribute closure of ), and
2. verify that it includes all attributes of R
• Using only F is incorrect when testing a relation in a
decomposition of R
– E.g. Consider R (A, B, C, D), with F = { A B, B C}
• Decompose R into R1(A,B) and R2(A,C,D)
• Neither of the dependencies in F contain only
attributes from
(A,C,D) so we might be mislead into thinking R2
satisfies BCNF.
• In fact, dependency A  C in F+ shows R2 is not in
BCNF.
BCNF Decomposition Algorithm
result := {R};
done := false;
compute F+;
while (not done) do
if (there is a schema Ri in result that is not in BCNF)
then begin
let    be a nontrivial functional
dependency that holds on Ri such that   Ri is not in
F+, and    = ;
result := (result – Ri )  (Ri – )  (,  );
end
else done := true;
Each Ri is in BCNF, and decomposition is lossless-join.
BCNF and Dependency Preservation
It is not always possible to get a BCNF decomposition that is
dependency preserving
• R = (A, B, C)
F = {AB  C
C  B}
Two candidate keys = AB and AC
• R is not in BCNF
• Any decomposition of R will fail to
preserve
AB  C
Third Normal Form
• A relation schema R is in third normal form (3NF) if for
all:    in F+ at least one of the following holds:
–    is trivial (i.e.,   )
–  is a superkey for R
– Each attribute A in  –  is contained in a candidate key
for R.
• If a relation is in BCNF it is in 3NF (since in BCNF one of
the first two conditions above must hold).
• Third condition is a minimal relaxation of BCNF to ensure
dependency preservation.
Third Normal Form
• Example
– R = (A,B,C)
F = {AB  C, C  B}
– Two candidate keys: AB and AC
– R is in 3NF
AB  C AB is a superkey
CB
B is contained in a candidate key
 BCNF decomposition has (AC) and (BC)
 Testing for AB  C requires a join
Testing for 3NF
• Use attribute closure to check for each dependency   ,
if  is a superkey.
• If  is not a superkey, we have to verify if each attribute in
 is contained in a candidate key of R
– this test is rather more expensive, since it involve
finding candidate keys
– testing for 3NF has been shown to be NP-hard
– However, decomposition into third normal form can be
done in polynomial time
3NF Decomposition Algorithm
Let Fc be a canonical cover for F;
i := 0;
for each functional dependency    in Fc do
if none of the schemas Rj, 1  j  i contains  
then begin
i := i + 1;
Ri :=  
end
if none of the schemas Rj, 1  j  i contains a candidate key
for R
then begin
i := i + 1;
Ri := any candidate key for R;
end
return (R1, R2, ..., Ri)
Storage Hierarchy
Storage Hierarchy
• primary storage: Fastest media but volatile (cache, main
memory).
• secondary storage: next level in hierarchy, non-volatile,
moderately fast access time
– also called on-line storage
– E.g. flash memory, magnetic disks
• tertiary storage: lowest level in hierarchy, non-volatile,
slow access time
– also called off-line storage
– E.g. magnetic tape, optical storage
Magnetic Disks
• Disk controller – interfaces between the computer system
and the disk drive hardware.
– accepts high-level commands to read or write a sector
– initiates actions such as moving the disk arm to the right
track and actually reading or writing the data
– Computes and attaches checksums to each sector to
verify that data is read back correctly
• If data is corrupted, with very high probability stored
checksum won’t match recomputed checksum
– Ensures successful writing by reading back sector after
writing it
– Performs remapping of bad sectors
Disk Subsystem
• Multiple disks connected to a computer system through a
controller
– Controllers functionality (checksum, bad sector
remapping) often carried out by individual disks; reduces
load on controller
• Disk interface standards families
– ATA (AT adaptor) range of standards
– SCSI (Small Computer System Interconnect) range of
standards
– Several variants of each standard (different speeds and
Performance Measures of Disks
• Access time – the time it takes from when a read or write
request is issued to when data transfer begins. Consists of:
– Seek time – time it takes to reposition the arm over the
correct track.
• Average seek time is 1/2 the worst case seek time.
• 4 to 10 milliseconds on typical disks
– Rotational latency – time it takes for the sector to be
accessed to appear under the head.
• Average latency is 1/2 of the worst case latency.
• 4 to 11 milliseconds on typical disks
– Data-transfer rate – the rate at which data can be
retrieved from or stored to the disk.
• 4 to 8 MB per second is typical
– Multiple disks may share a controller, so transfer rate that
controller can handle is also important
• E.g. ATA-5: 66 MB/second, SCSI-3: 40 MB/s
Storage Access
• A database file is partitioned into fixed-length storage units
called blocks. Blocks are units of both storage allocation
and data transfer.
• Database system seeks to minimize the number of block
transfers between the disk and memory. We can reduce the
number of disk accesses by keeping as many blocks as
possible in main memory.
• Buffer – portion of main memory available to store copies
of disk blocks.
• Buffer manager – subsystem responsible for allocating
buffer space in main memory.
Buffer Manager
• Programs call on the buffer manager when they need a
block from disk.
1. If the block is already in the buffer, the requesting
program is given the address of the block in main
memory
2. If the block is not in the buffer,
1. the buffer manager allocates space in the buffer for
the block, replacing (throwing out) some other
block, if required, to make space for the new block.
2. The block that is thrown out is written back to disk
only if it was modified since the most recent time
that it was written to/fetched from the disk.
3. Once space is allocated in the buffer, the buffer
manager reads the block from the disk to the buffer,
and passes the address of the block in main memory
Buffer-Replacement Policies
• Most operating systems replace the block least recently used
(LRU strategy)
• Idea behind LRU – use past pattern of block references as a
predictor of future references
• Queries have well-defined access patterns (such as sequential
scans), and a database system can use the information in a
user’s query to predict future references
– LRU can be a bad strategy for certain access patterns
involving repeated scans of data
• e.g. when computing the join of 2 relations r and s by a
nested loops
for each tuple tr of r do
for each tuple ts of s do
if the tuples tr and ts match …
– Mixed strategy with hints on replacement strategy provided
by the query optimizer is preferable
•
•
•
•
•
Buffer-Replacement Policies
Pinned block – memory block that is not allowed to be
written back to disk.
Toss-immediate strategy – frees the space occupied by
a block as soon as the final tuple of that block has been
processed
Most recently used (MRU) strategy – system must pin
the block currently being processed. After the final
tuple of that block has been processed, the block is
unpinned, and it becomes the most recently used block.
Buffer manager can use statistical information regarding
the probability that a request will reference a particular
relation
– E.g., the data dictionary is frequently accessed.
Heuristic: keep data-dictionary blocks in main
memory buffer
Buffer managers also support forced output of blocks for
the purpose of recovery
File Organization
• The database is stored as a collection of files. Each file is a
sequence of records. A record is a sequence of fields.
• One approach:
– assume record size is fixed
– each file has records of one particular type only
– different files are used for different relations
This case is easiest to implement; will consider variable
length records later.
Fixed-Length Records
• Simple approach:
– Store record i starting from byte n  (i – 1), where n is
the size of each record.
– Record access is simple but records may cross blocks
• Modification: do not allow records to cross block
boundaries
• Deletion of record I:
alternatives:
– move records i + 1, . . ., n
to i, . . . , n – 1
– move record n to i
– do not move records, but
link all free records on a
free list
Variable-Length Records: Slotted Page Structure
• Slotted page header contains:
– number of record entries
– end of free space in the block
– location and size of each record
• Records can be moved around within a page to keep them
contiguous with no empty space between them; entry in
the header must be updated.
• Pointers should not point directly to record — instead
they should point to the entry for the record in header.
Organization of Records in Files
• Heap – a record can be placed anywhere in the file where
there is space
• Sequential – store records in sequential order, based on
the value of the search key of each record
• Hashing – a hash function computed on some attribute of
each record; the result specifies in which block of the file
the record should be placed
• Records of each relation may be stored in a separate file. In
a clustering file organization records of several different
relations can be stored in the same file
– Motivation: store related records on the same block to
minimize I/O
Sequential File Organization
• Suitable for applications that require sequential
processing of the entire file
• The records in the file are ordered by a search-key
Clustering File Organization
• Simple file structure stores each relation in a separate file
• Can instead store several relations in one file using a clustering
file organization
• E.g., clustering organization of customer and depositor:
• l scan using a secondary index is expensive
– each record access may fetch a new block from disk
• an entry was deleted from their parent)
• Root node then had only one child, and was deleted and its child
became the new root node