CS186: Introduction to Database Systems

Download Report

Transcript CS186: Introduction to Database Systems

EECS 647: Introduction to Database Systems

Instructor: Luke Huan Spring 2007

Administrative

Homework #1 is

due next Monday (Feb 19 th )   Need to use your EECS password to login either cycle2 or wozniak (the same machine) Type psql and use your psql password to login  Let me know if there is anything wrong with you login  Be careful in typing O, 0, 1, l  Engineering & Computer Science Spring Career Fair is tomorrow (Feb 15 th ) from 12:30-5:00.  Game night with employer is tonight 7:30-10:30pm in the Burge Union, Relays room on 3 rd floor. 4/30/2020 Luke Huan Univ. of Kansas 2

Review

 The

closure of

Z

(denoted

Z

+ ) with respect to F is  the set of all attributes {

A

1 ,

A

2 , …} functionally determined by

Z

(that is, Z !

A

1

A

2 …) 4/30/2020 Luke Huan Univ. of Kansas 3

Review

  

Property(County_name, Lot#, Area, Price, Tax_rate)

 

County_name

!

Tax_rate Area

!

Price

County_name, Lot#

!

Area

Compute the closur of {

County_name, Lot#

} +     {

County_name, Lot#

} +  {

County_name, Lot#

} +  {

County_name, Lot#

} +  {

County_name, Lot#

} +  {

County_name, Lot#

} {

County_name, Lot#, Tax_rate

} {

County_name, Lot#, Tax_rate, Area

} {

County_name, Lot#, Area, Price, Tax_rate

} Is

County_name, Lot#

a super key?  Yes 4/30/2020 Luke Huan Univ. of Kansas 4

Today’s Topic

 Non-key dependency  BCNF  SQL   Select-from-where Set vs. bag 4/30/2020 Luke Huan Univ. of Kansas 5

Bad decomposition

EID 1234 1123 1234 1023 PID 10 9 9 10 Hours 10 40 30 40 EID PID EID Hours 1234 1123 10 9 1234 1123 10 40 1234 9 1234 30  1023 10 1023 Association between

PID

and

hours

is lost 40  Join returns more rows than the original relation 4/30/2020 Luke Huan Univ. of Kansas 6

Lossless join decomposition

  Decompose relation

R

into relations

S

and

T

  

attrs

(

R

) =

attrs

(

S

) [

S T

= ¼

attrs

(

S

) = ¼

attrs

(

T

) (

R

(

R

) )

attrs

(

T

) The decomposition is a

lossless join decomposition

given known

constraints

that

R

=

S

T

if, such as FD’s, we can guarantee  Any decomposition gives  A

lossy R

µ

S

T

decomposition is one with

R

(why?) 

S

T

4/30/2020 Luke Huan Univ. of Kansas 7

Loss? But I got more rows!

 “Loss” refers not to the loss of tuples, but to the loss of information  Or, the ability to distinguish different original tuples EID PID Hours 1234 1123 1234 1023 10 9 9 10 10 40 30 40 4/30/2020 EID 1234 1123 1234 1023 PID 10 9 9 10 Luke Huan Univ. of Kansas EID 1234 1123 1234 1023 Hours 10 40 30 40 8

Questions about decomposition

 When to decompose  How to come up with a correct decomposition (i.e., lossless join decomposition) 4/30/2020 Luke Huan Univ. of Kansas 9

Non key FD’s

 Consider a non-trivial FD

X

!

Y

where

X

is

not

a super key  Since

X

is not a super key, there are some attributes (say

Z

) that are not functionally determined by

X

X a a Y b b Z c d That

b

is always associated with

a

is recorded by multiple rows:

redundancy

,

update anomaly

,

deletion anomaly

4/30/2020 Luke Huan Univ. of Kansas 10

4/30/2020 X a a Z c d

Deal with Non-key FD

X a a Y b b Z c d

X

!

Y

X a Y b X a a  Y b b Z c d Luke Huan Univ. of Kansas 11

Dealing with Nonkey Dependency: BCNF

 A relation

R

is in

B

oyce-

C

odd

N

ormal

F

orm if   For every non-trivial FD

X

!

Y

in

R

,

X

is a super key That is, all FDs follow from “key !

other attributes”  Or, no non-key dependency  When to decompose  As long as some relation is not in BCNF  How to come up with a correct decomposition  Always decompose on a BCNF violation (details next)  Then it is guaranteed to be a lossless join decomposition!

4/30/2020 Luke Huan Univ. of Kansas 12

BCNF decomposition algorithm

 Find a

BCNF violation

 That is, a non-trivial FD

X

!

Y

in

R

where

X

is not a super key of

R

 Decompose

R

into

R

1 and

R

2 , where 

R

1 has attributes

X

[

Y

R

2 has attributes

X

that are in neither

X

[

Z

, where

Z

contains all attributes of

R

nor

Y

(i.e.

Z = attr(R) – X – Y

)  Repeat until all relations are in BCNF 4/30/2020 Luke Huan Univ. of Kansas 13

BCNF decomposition example

EID

!

Ename

,

email EID, PID

!

Hours email

!

EID

WorkOn

(

EID

,

Ename

,

email

,

PID

,

hours

)

BCNF violation:

EID

!

Ename

,

email Student

(

EID

,

Ename

,

email

)

BCNF

Grade

(

EID

,

PID

,

hours

)

BCNF

4/30/2020 Luke Huan Univ. of Kansas 14

Another example

EID

!

Ename

,

email EID, PID

!

Hours email

!

EID

WorkOn

(

EID

,

Ename

,

email

,

PID

,

hours

)

BCNF violation:

email

!

EID StudentID

(

email

,

EID

)

BCNF

StudentGrade’

(

email

,

Ename

,

PID

,

hours

)

BCNF violation:

email

!

Ename

4/30/2020

StudentName

(

email

,

Ename

)

BCNF

Grade

(

email

,

PID

,

hours

) Luke Huan Univ. of Kansas 15

BCNF

Exercise

Property(Property_id#, County_name, Lot#, Area, Price, Tax_rate

)  Property_id# !

County_name, Lot#, Area, Price, Tax_rate

  

County_name, Lot#

! Property_id

#, Area, Price, Tax_rate County_name

!

Tax_rate area

!

Price

16 4/30/2020 Luke Huan Univ. of Kansas

Exercise

Property(Property_id#, County_name, Lot#, Area, Price, Tax_rate) BCNF violation: County_name

!

Tax_rate LOTS1

BCNF

(

County_name, Tax_rate LOTS2

( )

Property_id#, County_name, Lot#, Area, Price

)

BCNF violation: Area

!

Price LOTS2A

BCNF

(

Area, Price

)

LOTS2B

(

Property_id#, County_name, Lot#, Area

)

BCNF

4/30/2020 Luke Huan Univ. of Kansas 17

Why is BCNF decomposition lossless

Given non-trivial

X

!

Y

in

R

where

X

is

not

a super key of

R

, need to prove:  Anything we project always comes back in the join:

R

µ ¼

XY

(

R

)  ¼

XZ

(

R

)  Sure; and it doesn’t depend on the FD  Anything that comes back in the join must be in the original relation:

R

¶ ¼

XY

(

R

)  ¼

XZ

(

R

)  Proof makes use of the fact that

X

!

Y

4/30/2020 Luke Huan Univ. of Kansas 18

Summery

 Non-key functional dependencies: a source of redundancy  BCNF decomposition: a method removes ALL non-key dependency and therefore completely removes any functional-dependency-related redundancies  Plus, BNCF decomposition is a lossless join decomposition 4/30/2020 Luke Huan Univ. of Kansas 19

Database Design

4/30/2020 Luke Huan Univ. of Kansas 20

SQL

 SQL:

Structured Query Language

 Pronounced “S-Q-L” or “sequel”  The standard query language supported by most commercial DBMS  A brief history      IBM System R ANSI SQL89 ANSI SQL92 (SQL2) ANSI SQL99 (SQL3) ANSI SQL 2003 (+OLAP, XML, etc.) 4/30/2020 Luke Huan Univ. of Kansas 21

Creating and dropping tables

   CREATE TABLE

table_name

(

,

column_name i column_type i

, DROP TABLE

table_name

;

); Examples create table Student (SID integer, name varchar(30), email varchar(30), age integer, GPA float); create table Course (CID char(10), title varchar(100)); create table Enroll (SID integer, CID char(10)); drop table Student; drop table Course; drop table Enroll; -- everything from -- to the end of the line is ignored.

-- SQL is insensitive to white space.

-- SQL is insensitive to case (e.g., ...Course... is equivalent to -- ...COURSE...) 4/30/2020 Luke Huan Univ. of Kansas 22

Basic queries: SFW statement

 SELECT

A

1 ,

A

2 , … ,

A n

FROM

R

1 ,

R

2 , … ,

R m

WHERE

condition

;  Also called an SPJ (select-project-join) query  (

almost

) Equivalent to relational algebra query ¼

A

1 ,

A

2 , …,

A n

( ¾

condition

(

R

1 £

R

2 £ … £

R m

)) 4/30/2020 Luke Huan Univ. of Kansas 23

Semantics of SFW

 SELECT

E

1 ,

E

2 , … ,

E n

FROM

R

1 ,

R

2 , … ,

R m

WHERE

condition

;    For each

t

1 For each in

R

1 :

t

2 in

R

2 : … … For each If

t m

in

R m

:

condition

is true over

t

1 ,

t

2 , …,

t m

: Compute and output

E

1 ,

E

2 , …,

E n t

1 ,

t

2 , …,

t m

are often called tuple variables as a row Not 100% correct, we will see 4/30/2020 Luke Huan Univ. of Kansas 24

Example: selection and projection

 Name of students under 18  SELECT name FROM Student WHERE age < 20;  ¼

name

( ¾

age <20

(

Student

)) sid 1234 1123 1011 1204 1306 name John Smith Mary Carter Bob Lee Susan Wong Kevin Kim age 21 19 22 22 18 gpa 3.5

3.8

2.6

3.4

2.9

sid 1234 1123 1011 1204 1306 name John Smith Mary Carter Bob Lee Susan Wong Kevin Kim age 21 19 22 22 18 gpa 3.5

3.8

2.6

3.4

2.9

4/30/2020 Luke Huan Univ. of Kansas 25

Example: Operations

 When was Lisa born?

 SELECT 2006 – age FROM Student WHERE name = ’Lisa’;  SELECT list can contain expressions  Can also use built-in functions such as SUBSTR , ABS , etc.

 String literals (case sensitive) are enclosed in single quotes 4/30/2020 Luke Huan Univ. of Kansas 26

Example: reading a table

 SELECT * FROM Student;    Single-table query, so no cross product here WHERE clause is optional * is a short hand for “all columns” 4/30/2020 Luke Huan Univ. of Kansas 27

Example: join

 SID’s and names of students taking the “Database” courses  SELECT Student.SID, Student.name

FROM Student, Enroll, Course WHERE Student.SID = Enroll.SID

AND Enroll.CID = Course.CID

AND title = ’Database’;   Okay to omit

table_name

in

table_name

.

column_name

if

column_name

is unique A better way to deal with string is to use “LIKE” and % , which matches with any string with length 0 or more  AND title LIKE ’%Database%’;   \ is the escape operator ‘_’ is used for representing any a single character 4/30/2020 Luke Huan Univ. of Kansas 28

ORDER BY

  SELECT ...

FROM … WHERE ORDER BY

output_column

|

DESC

]

,

; ASC = ascending, DESC = descending

[

ASC  Operational semantics  After SELECT list has been computed, sort the output according to ORDER BY specification 4/30/2020 Luke Huan Univ. of Kansas 29

ORDER BY example

 List all students, sort them by GPA (descending) and name (ascending)  SELECT SID, name, age, GPA FROM Student ORDER BY GPA DESC, name;  ASC is the default option  Strictly speaking, only output columns can appear in ORDER BY clause (although some DBMS support more)  Can use sequence numbers instead of names to refer to output columns: ORDER BY 4 DESC, 2; 4/30/2020 Luke Huan Univ. of Kansas 30

Example: rename

 SID’s of all pairs of classmates    Relational algebra query: ¼

e

1.

SID

,

e

2.

SID

( ½

e

1

Enroll

!

e

1.

CID

=

e

2.

CID

Æ

e

1.

SID

>

e

2.

SID

½

e

2

Enroll

) SQL: SELECT e1.SID

AS

FROM Enroll

AS

SID1, e2.SID e1, Enroll

AS AS

e2 WHERE e1.CID = e2.CID

AND e1.SID > e2.SID; SID2

AS

keyword is optional 4/30/2020 Luke Huan Univ. of Kansas 31

A more complicated example

 Titles of all courses that Bart and Lisa are taking together SELECT c.title

FROM Student sb, Student sl, Enroll eb, Enroll el, Course c WHERE sb.name = ’Bart’ AND sl.name = ’Lisa’ AND eb.SID = sb.SID AND el.SID = sl.SID

AND eb.CID = c.CID AND el.CID = c.CID; Tip: Write the FROM clause first, then WHERE , and then SELECT 4/30/2020 Luke Huan Univ. of Kansas 32

Why SFW statements?

 Out of many possible ways of structuring SQL statements, why did the designers choose SELECT FROM WHERE ?

   A large number of queries can be written using only selection, projection, and cross product (or join) Any query that uses only these operators can be written in a canonical form: ¼

L

( ¾

p

(

R

1 £ … £

R m

))  Example: ¼

R

.

A

,

S

.

B

¼

R

.

A

,

S

.

B

,

T

.

C

¾

p

1 Æ (

R p

2 Æ !

p

1

p

3 (

S R

) £ !

S p

£ 2 ( ¼

T

.

C T

) ¾

p

3

T

) = SELECT FROM WHERE captures this canonical form 4/30/2020 Luke Huan Univ. of Kansas 33

Set versus bag semantics

 Set   No duplicates Relational model and algebra use set semantics  Bag    Duplicates allowed Number of duplicates is significant SQL uses bag semantics by default 4/30/2020 Luke Huan Univ. of Kansas 34

Set versus bag example

sid 1234 1234 1124 1123

Enroll

cid 647 108 647 108 grade A B A A ¼

SID Enroll

sid 1234 1124 1123 SELECT SID FROM Enroll; Luke Huan Univ. of Kansas sid 1234 1234 1124 1123 4/30/2020 35

A case for bag semantics

 Efficiency  Saves time of eliminating duplicates  Which one is more useful?

  ¼

GPA Student

SELECT GPA FROM Student;  The first query just returns all possible GPA’s  The second query returns the actual GPA distribution  Besides, SQL provides the option of set semantics with DISTINCT keyword 4/30/2020 Luke Huan Univ. of Kansas 36

Forcing set semantics

 SID’s of all pairs of classmates  SELECT e1.SID AS SID1, e2.SID AS SID2 FROM Enroll AS e1, Enroll AS e2 WHERE e1.CID = e2.CID

AND e1.SID > e2.SID;   Say Bart and Lisa both take CPS116 and CPS114 SELECT DISTINCT SID2 ...

e1.SID AS SID1, e2.SID AS  With DISTINCT , all duplicate ( SID1 , SID2 ) pairs are removed from the output 4/30/2020 Luke Huan Univ. of Kansas 37

Operational semantics of SFW

 SELECT [ DISTINCT ] FROM

R

1 ,

R

2 , … ,

R m E

1 ,

E

2 , … ,

E n

WHERE

condition

;   For each

t

1 For each in

R

1 :

t

2 in

R

2 : … … For each If

t m

in

R m

:

condition

is true over

t

1 ,

t

2 , …,

t m

: Compute and output

E

1 ,

E

2 , …,

E n

If DISTINCT is present as a row Eliminate duplicate rows in output

t

1 ,

t

2 , …,

t m

are often called tuple variables 4/30/2020 Luke Huan Univ. of Kansas 38

Review

 Non-key dependency is a source of redundancy   Using BCNF to remove dependency related redundancy SELECT FROM WHERE statements (select-project-join queries)  Set and bag operations  Next: how to nest SQL queries 4/30/2020 Luke Huan Univ. of Kansas 39