Transcript CS186: Introduction to Database Systems
EECS 647: Introduction to Database Systems
Instructor: Luke Huan Spring 2007
Administrative
Homework #1 is
due next Monday (Feb 19 th ) Need to use your EECS password to login either cycle2 or wozniak (the same machine) Type psql and use your psql password to login Let me know if there is anything wrong with you login Be careful in typing O, 0, 1, l Engineering & Computer Science Spring Career Fair is tomorrow (Feb 15 th ) from 12:30-5:00. Game night with employer is tonight 7:30-10:30pm in the Burge Union, Relays room on 3 rd floor. 4/30/2020 Luke Huan Univ. of Kansas 2
Review
The
closure of
Z
(denoted
Z
+ ) with respect to F is the set of all attributes {
A
1 ,
A
2 , …} functionally determined by
Z
(that is, Z !
A
1
A
2 …) 4/30/2020 Luke Huan Univ. of Kansas 3
Review
Property(County_name, Lot#, Area, Price, Tax_rate)
County_name
!
Tax_rate Area
!
Price
County_name, Lot#
!
Area
Compute the closur of {
County_name, Lot#
} + {
County_name, Lot#
} + {
County_name, Lot#
} + {
County_name, Lot#
} + {
County_name, Lot#
} + {
County_name, Lot#
} {
County_name, Lot#, Tax_rate
} {
County_name, Lot#, Tax_rate, Area
} {
County_name, Lot#, Area, Price, Tax_rate
} Is
County_name, Lot#
a super key? Yes 4/30/2020 Luke Huan Univ. of Kansas 4
Today’s Topic
Non-key dependency BCNF SQL Select-from-where Set vs. bag 4/30/2020 Luke Huan Univ. of Kansas 5
Bad decomposition
EID 1234 1123 1234 1023 PID 10 9 9 10 Hours 10 40 30 40 EID PID EID Hours 1234 1123 10 9 1234 1123 10 40 1234 9 1234 30 1023 10 1023 Association between
PID
and
hours
is lost 40 Join returns more rows than the original relation 4/30/2020 Luke Huan Univ. of Kansas 6
Lossless join decomposition
Decompose relation
R
into relations
S
and
T
attrs
(
R
) =
attrs
(
S
) [
S T
= ¼
attrs
(
S
) = ¼
attrs
(
T
) (
R
(
R
) )
attrs
(
T
) The decomposition is a
lossless join decomposition
given known
constraints
that
R
=
S
T
if, such as FD’s, we can guarantee Any decomposition gives A
lossy R
µ
S
T
decomposition is one with
R
(why?)
S
T
4/30/2020 Luke Huan Univ. of Kansas 7
Loss? But I got more rows!
“Loss” refers not to the loss of tuples, but to the loss of information Or, the ability to distinguish different original tuples EID PID Hours 1234 1123 1234 1023 10 9 9 10 10 40 30 40 4/30/2020 EID 1234 1123 1234 1023 PID 10 9 9 10 Luke Huan Univ. of Kansas EID 1234 1123 1234 1023 Hours 10 40 30 40 8
Questions about decomposition
When to decompose How to come up with a correct decomposition (i.e., lossless join decomposition) 4/30/2020 Luke Huan Univ. of Kansas 9
Non key FD’s
Consider a non-trivial FD
X
!
Y
where
X
is
not
a super key Since
X
is not a super key, there are some attributes (say
Z
) that are not functionally determined by
X
X a a Y b b Z c d That
b
is always associated with
a
is recorded by multiple rows:
redundancy
,
update anomaly
,
deletion anomaly
4/30/2020 Luke Huan Univ. of Kansas 10
4/30/2020 X a a Z c d
Deal with Non-key FD
X a a Y b b Z c d
X
!
Y
X a Y b X a a Y b b Z c d Luke Huan Univ. of Kansas 11
Dealing with Nonkey Dependency: BCNF
A relation
R
is in
B
oyce-
C
odd
N
ormal
F
orm if For every non-trivial FD
X
!
Y
in
R
,
X
is a super key That is, all FDs follow from “key !
other attributes” Or, no non-key dependency When to decompose As long as some relation is not in BCNF How to come up with a correct decomposition Always decompose on a BCNF violation (details next) Then it is guaranteed to be a lossless join decomposition!
4/30/2020 Luke Huan Univ. of Kansas 12
BCNF decomposition algorithm
Find a
BCNF violation
That is, a non-trivial FD
X
!
Y
in
R
where
X
is not a super key of
R
Decompose
R
into
R
1 and
R
2 , where
R
1 has attributes
X
[
Y
R
2 has attributes
X
that are in neither
X
[
Z
, where
Z
contains all attributes of
R
nor
Y
(i.e.
Z = attr(R) – X – Y
) Repeat until all relations are in BCNF 4/30/2020 Luke Huan Univ. of Kansas 13
BCNF decomposition example
EID
!
Ename
,
email EID, PID
!
Hours email
!
EID
WorkOn
(
EID
,
Ename
,
,
PID
,
hours
)
BCNF violation:
EID
!
Ename
,
email Student
(
EID
,
Ename
,
)
BCNF
Grade
(
EID
,
PID
,
hours
)
BCNF
4/30/2020 Luke Huan Univ. of Kansas 14
Another example
EID
!
Ename
,
email EID, PID
!
Hours email
!
EID
WorkOn
(
EID
,
Ename
,
,
PID
,
hours
)
BCNF violation:
!
EID StudentID
(
,
EID
)
BCNF
StudentGrade’
(
,
Ename
,
PID
,
hours
)
BCNF violation:
!
Ename
4/30/2020
StudentName
(
,
Ename
)
BCNF
Grade
(
,
PID
,
hours
) Luke Huan Univ. of Kansas 15
BCNF
Exercise
Property(Property_id#, County_name, Lot#, Area, Price, Tax_rate
) Property_id# !
County_name, Lot#, Area, Price, Tax_rate
County_name, Lot#
! Property_id
#, Area, Price, Tax_rate County_name
!
Tax_rate area
!
Price
16 4/30/2020 Luke Huan Univ. of Kansas
Exercise
Property(Property_id#, County_name, Lot#, Area, Price, Tax_rate) BCNF violation: County_name
!
Tax_rate LOTS1
BCNF
(
County_name, Tax_rate LOTS2
( )
Property_id#, County_name, Lot#, Area, Price
)
BCNF violation: Area
!
Price LOTS2A
BCNF
(
Area, Price
)
LOTS2B
(
Property_id#, County_name, Lot#, Area
)
BCNF
4/30/2020 Luke Huan Univ. of Kansas 17
Why is BCNF decomposition lossless
Given non-trivial
X
!
Y
in
R
where
X
is
not
a super key of
R
, need to prove: Anything we project always comes back in the join:
R
µ ¼
XY
(
R
) ¼
XZ
(
R
) Sure; and it doesn’t depend on the FD Anything that comes back in the join must be in the original relation:
R
¶ ¼
XY
(
R
) ¼
XZ
(
R
) Proof makes use of the fact that
X
!
Y
4/30/2020 Luke Huan Univ. of Kansas 18
Summery
Non-key functional dependencies: a source of redundancy BCNF decomposition: a method removes ALL non-key dependency and therefore completely removes any functional-dependency-related redundancies Plus, BNCF decomposition is a lossless join decomposition 4/30/2020 Luke Huan Univ. of Kansas 19
Database Design
4/30/2020 Luke Huan Univ. of Kansas 20
SQL
SQL:
Structured Query Language
Pronounced “S-Q-L” or “sequel” The standard query language supported by most commercial DBMS A brief history IBM System R ANSI SQL89 ANSI SQL92 (SQL2) ANSI SQL99 (SQL3) ANSI SQL 2003 (+OLAP, XML, etc.) 4/30/2020 Luke Huan Univ. of Kansas 21
Creating and dropping tables
CREATE TABLE
table_name
(
…
,
column_name i column_type i
, DROP TABLE
table_name
;
…
); Examples create table Student (SID integer, name varchar(30), email varchar(30), age integer, GPA float); create table Course (CID char(10), title varchar(100)); create table Enroll (SID integer, CID char(10)); drop table Student; drop table Course; drop table Enroll; -- everything from -- to the end of the line is ignored.
-- SQL is insensitive to white space.
-- SQL is insensitive to case (e.g., ...Course... is equivalent to -- ...COURSE...) 4/30/2020 Luke Huan Univ. of Kansas 22
Basic queries: SFW statement
SELECT
A
1 ,
A
2 , … ,
A n
FROM
R
1 ,
R
2 , … ,
R m
WHERE
condition
; Also called an SPJ (select-project-join) query (
almost
) Equivalent to relational algebra query ¼
A
1 ,
A
2 , …,
A n
( ¾
condition
(
R
1 £
R
2 £ … £
R m
)) 4/30/2020 Luke Huan Univ. of Kansas 23
Semantics of SFW
SELECT
E
1 ,
E
2 , … ,
E n
FROM
R
1 ,
R
2 , … ,
R m
WHERE
condition
; For each
t
1 For each in
R
1 :
t
2 in
R
2 : … … For each If
t m
in
R m
:
condition
is true over
t
1 ,
t
2 , …,
t m
: Compute and output
E
1 ,
E
2 , …,
E n t
1 ,
t
2 , …,
t m
are often called tuple variables as a row Not 100% correct, we will see 4/30/2020 Luke Huan Univ. of Kansas 24
Example: selection and projection
Name of students under 18 SELECT name FROM Student WHERE age < 20; ¼
name
( ¾
age <20
(
Student
)) sid 1234 1123 1011 1204 1306 name John Smith Mary Carter Bob Lee Susan Wong Kevin Kim age 21 19 22 22 18 gpa 3.5
3.8
2.6
3.4
2.9
sid 1234 1123 1011 1204 1306 name John Smith Mary Carter Bob Lee Susan Wong Kevin Kim age 21 19 22 22 18 gpa 3.5
3.8
2.6
3.4
2.9
4/30/2020 Luke Huan Univ. of Kansas 25
Example: Operations
When was Lisa born?
SELECT 2006 – age FROM Student WHERE name = ’Lisa’; SELECT list can contain expressions Can also use built-in functions such as SUBSTR , ABS , etc.
String literals (case sensitive) are enclosed in single quotes 4/30/2020 Luke Huan Univ. of Kansas 26
Example: reading a table
SELECT * FROM Student; Single-table query, so no cross product here WHERE clause is optional * is a short hand for “all columns” 4/30/2020 Luke Huan Univ. of Kansas 27
Example: join
SID’s and names of students taking the “Database” courses SELECT Student.SID, Student.name
FROM Student, Enroll, Course WHERE Student.SID = Enroll.SID
AND Enroll.CID = Course.CID
AND title = ’Database’; Okay to omit
table_name
in
table_name
.
column_name
if
column_name
is unique A better way to deal with string is to use “LIKE” and % , which matches with any string with length 0 or more AND title LIKE ’%Database%’; \ is the escape operator ‘_’ is used for representing any a single character 4/30/2020 Luke Huan Univ. of Kansas 28
ORDER BY
SELECT ...
FROM … WHERE ORDER BY
output_column
|
DESC
]
,
…
; ASC = ascending, DESC = descending
[
ASC Operational semantics After SELECT list has been computed, sort the output according to ORDER BY specification 4/30/2020 Luke Huan Univ. of Kansas 29
ORDER BY example
List all students, sort them by GPA (descending) and name (ascending) SELECT SID, name, age, GPA FROM Student ORDER BY GPA DESC, name; ASC is the default option Strictly speaking, only output columns can appear in ORDER BY clause (although some DBMS support more) Can use sequence numbers instead of names to refer to output columns: ORDER BY 4 DESC, 2; 4/30/2020 Luke Huan Univ. of Kansas 30
Example: rename
SID’s of all pairs of classmates Relational algebra query: ¼
e
1.
SID
,
e
2.
SID
( ½
e
1
Enroll
!
e
1.
CID
=
e
2.
CID
Æ
e
1.
SID
>
e
2.
SID
½
e
2
Enroll
) SQL: SELECT e1.SID
AS
FROM Enroll
AS
SID1, e2.SID e1, Enroll
AS AS
e2 WHERE e1.CID = e2.CID
AND e1.SID > e2.SID; SID2
AS
keyword is optional 4/30/2020 Luke Huan Univ. of Kansas 31
A more complicated example
Titles of all courses that Bart and Lisa are taking together SELECT c.title
FROM Student sb, Student sl, Enroll eb, Enroll el, Course c WHERE sb.name = ’Bart’ AND sl.name = ’Lisa’ AND eb.SID = sb.SID AND el.SID = sl.SID
AND eb.CID = c.CID AND el.CID = c.CID; Tip: Write the FROM clause first, then WHERE , and then SELECT 4/30/2020 Luke Huan Univ. of Kansas 32
Why SFW statements?
Out of many possible ways of structuring SQL statements, why did the designers choose SELECT FROM WHERE ?
A large number of queries can be written using only selection, projection, and cross product (or join) Any query that uses only these operators can be written in a canonical form: ¼
L
( ¾
p
(
R
1 £ … £
R m
)) Example: ¼
R
.
A
,
S
.
B
¼
R
.
A
,
S
.
B
,
T
.
C
¾
p
1 Æ (
R p
2 Æ !
p
1
p
3 (
S R
) £ !
S p
£ 2 ( ¼
T
.
C T
) ¾
p
3
T
) = SELECT FROM WHERE captures this canonical form 4/30/2020 Luke Huan Univ. of Kansas 33
Set versus bag semantics
Set No duplicates Relational model and algebra use set semantics Bag Duplicates allowed Number of duplicates is significant SQL uses bag semantics by default 4/30/2020 Luke Huan Univ. of Kansas 34
Set versus bag example
sid 1234 1234 1124 1123
Enroll
cid 647 108 647 108 grade A B A A ¼
SID Enroll
sid 1234 1124 1123 SELECT SID FROM Enroll; Luke Huan Univ. of Kansas sid 1234 1234 1124 1123 4/30/2020 35
A case for bag semantics
Efficiency Saves time of eliminating duplicates Which one is more useful?
¼
GPA Student
SELECT GPA FROM Student; The first query just returns all possible GPA’s The second query returns the actual GPA distribution Besides, SQL provides the option of set semantics with DISTINCT keyword 4/30/2020 Luke Huan Univ. of Kansas 36
Forcing set semantics
SID’s of all pairs of classmates SELECT e1.SID AS SID1, e2.SID AS SID2 FROM Enroll AS e1, Enroll AS e2 WHERE e1.CID = e2.CID
AND e1.SID > e2.SID; Say Bart and Lisa both take CPS116 and CPS114 SELECT DISTINCT SID2 ...
e1.SID AS SID1, e2.SID AS With DISTINCT , all duplicate ( SID1 , SID2 ) pairs are removed from the output 4/30/2020 Luke Huan Univ. of Kansas 37
Operational semantics of SFW
SELECT [ DISTINCT ] FROM
R
1 ,
R
2 , … ,
R m E
1 ,
E
2 , … ,
E n
WHERE
condition
; For each
t
1 For each in
R
1 :
t
2 in
R
2 : … … For each If
t m
in
R m
:
condition
is true over
t
1 ,
t
2 , …,
t m
: Compute and output
E
1 ,
E
2 , …,
E n
If DISTINCT is present as a row Eliminate duplicate rows in output
t
1 ,
t
2 , …,
t m
are often called tuple variables 4/30/2020 Luke Huan Univ. of Kansas 38
Review
Non-key dependency is a source of redundancy Using BCNF to remove dependency related redundancy SELECT FROM WHERE statements (select-project-join queries) Set and bag operations Next: how to nest SQL queries 4/30/2020 Luke Huan Univ. of Kansas 39