Databases 2003 - Carnegie Mellon University

Download Report

Transcript Databases 2003 - Carnegie Mellon University

eCommerce Technology
20-751
Databases
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Concepts
• Relational model
• SQL
• DB construction
– Normalization
– ER diagrams
• Transactions
• Web support
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Critical Role of Data
• Without data, an organization cannot function
– especially in eCommerce
• Initially, data was prepared for specific applications
– payroll data for the payroll system
– parts lists for the bill of materials system
– sales data for statistical analysis
• By 1970, clear that data had common properties
• Data for many applications could be stored together
in an organized way
– database instead of separate collections
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
What is a Database?
• No formal definition
• A collection of related data allowing:
– insert (add new data)
– delete (delete existing data)
– update (change existing data = delete + insert)
– query (retrieve all data having a certain property)
• What does “related” mean?
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Database Management System
• Based on a data model, e.g. relational, object,
hierarchical
• Has data definition language (DDL) to identify data
• Has data manipulation language (DML) for queries
and updates
• Separates structure of data from DB implementation
• Enforces data structure and content rules
• Handles transactions, concurrent operations
• Allows backup and recovery from errors
• Connects to other software
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
The Relational Model
• A set is a collection of unique items
{ CS, HCII, ISRI, RI, LTI, CALD } Divisions of SCS
{ CS, HCII, CS, HCII, RI, CS } NOT A SET (repeated elements)
• A relation on two sets A, B is a set of pairs of
elements, one from A and one from B
A = { 46-870, 20-751, 46-749, 20-753, 20-770 }
B = { GSIA, SCS }
R = { (46-870, GSIA), (20-751, SCS), (20-753, SCS),
(20-770, GSIA), (46-749, GSIA) }
• Relations can be defined on any number of sets
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
The Relational Model of Data
A = { 46-870, 20-751, 46-749, 20-753, 20-770 }
B = { GSIA, SCS }
R = { (46-870, GSIA), (20-751, SCS), (20-753, SCS), (20-770,
GSIA), (46-749, GSIA) }
20-770
46-870
46-749
20-753
20-751
GSIA
SCS
This is the graph of the relation R
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
The Relational Model of Data
A = { 46-870, 20-751, 46-749, 20-753, 20-770 }
B = { GSIA, SCS }
R = { (46-870, GSIA), (20-751, SCS), (20-753, SCS), (20-770,
GSIA), (46-749, GSIA) }
CONTAINS ONLY
COURSE NUMBERS
COURSE
SCHOOL
20-770
SCS
46-749
GSIA
20-753
SCS
20-751
SCS
46-870
GSIA
CONTAINS ONLY
SCHOOL NAMES
This is a table of the relation R
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
The Relational Model of Data
Relations are not necessarily binary. May involve many sets:
COURSE SCHOOL REQ'D ROOM #
FACULTY
DEPT
20-770
SCS
Y
152
64
STEENKISTE
CS
46-749
GSIA
N
150
62
GOETTLER
GSIA
20-753
SCS
N
150
57
NYBERG
LTI
20-751
SCS
Y
152
64
SHAMOS
LTI
46-870
GSIA
Y
152
64
MUKHOPADHYAY GSIA
• Each row is a 7-tuple. Relation on 7 sets.
• No implied ordering of either rows or columns. Sorting is irrelevant
• Note: bad table design since “DEPT” is an attribute of “FACULTY”,
not “COURSE”
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Tables
•
•
•
•
A relation can be represented as a table
One row for each tuple in the relation
Easier to draw than a graph
Table has implicit order (of rows and columns)
– But: a relation has no ordering, either of tuples or
attributes
• The cardinality C(R) of a relation R is the number of
tuples it contains = # of rows in its table
• Relational model represents data as a collection of
unordered two-dimensional tables
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Keys
• Key: an attribute (or minimum set of attributes) that
uniquely defines a tuple
– In the example relation, “Course” is a key
• A relation may have more than one key.
• A set of attributes that can serve as a key is a
candidate key.
• One is chosen as the primary key.
• Keys are used to reference (retrieve) tuples.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Foreign Keys
• A key from one relation that is an attribute of another
relation is a foreign key.
• If we had a “Faculty” relation, then “Faculty” would be
a foreign key in the “Courses” relation.
• Foreign keys connect relations together.
FOREIGN KEY:
PRIMARY KEY
PRIMARY KEY
COURSE SCHOOL REQD RM #
FACULTY
20-770
SCS
Y
152 64 STEENKISTE
FACULTY
GOETTLER
46-749
GSIA
N
150 62 GOETTLER
MUKHOPADHYAY GSIA
20-753
SCS
N
150 57 NYBERG
NYBERG
LTI
20-751
SCS
Y
152 64 SHAMOS
SHAMOS
LTI
46-870
GSIA
Y
152 64 MUKHOPADHYAY
STEENKISTE
CS
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
DEPT
GSIA
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Operations on Relations
• Projection
List specific attributes L (columns) of R, written L(R)
E.g. show course number and room
Course, Room(Courses)
Courses
COURSE SCHOOL REQD RM
20-770
SCS
Y
152
# FACULTY
64 STEENKISTE
DEPT
CS
COURSE RM
20-770
152
46-749
GSIA
N
150
62 GOETTLER
GSIA
46-749
150
20-753
SCS
N
150
57 NYBERG
LTI
20-753
150
20-751
SCS
Y
152
64 SHAMOS
LTI
20-751
152
46-870
GSIA
Y
152
64 MUKHOPADHYAY GSIA
46-870
152
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Operations on Relations
• Selection (extract horizontal slices)
– List all tuples of relation R whose attributes satisfy
condition C, written C(R)
– E.g. show all tuples with Room = 152, Room=152(R)
COURSE SCHOOL REQD RM
20-770
SCS
Y
152
# FACULTY
64 STEENKISTE
DEPT
CS
46-749
GSIA
N
150
64 GOETTLER
GSIA
20-751
SCS
Y
152
64 SHAMOS
LTI
• Projection & Selection are unary (1-table) operations
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Structured Query Language (SQL)
• A data manipulation language for manipulating
relational databases
• SELECT queries the database
• UPDATE modifies relations
• DELETE removes tuples
Syntax of the SQL SELECT command:
SELECT { attributes }
FROM { table }
WHERE { attribute-conditions };
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Structured Query Language (SQL)
YIELDS DISTINCT TUPLES
SINCE CourseNo IS A KEY
• Projection
– SQL: SELECT CourseNo, Room FROM Courses;
– SQL: SELECT DISTINCT Room FROM Courses;
MUST ASK FOR DISTINCT TUPLES
SINCE Room IS NOT A KEY
• Selection
– SELECT * FROM Courses WHERE Room= “152”;
– Give a table of all courses that meet in 152
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Join
• The natural join A * B consists of tuples with
matching attributes (names & values) in A and B
• Natural join is a way of obtaining information across
tables
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Natural Join A * B
• Attribute names and values must match
Statistics:
Geography:
Detroit MI
=
1027974
*
City
State Pop
% For
Seattle WA
532900 13.1
3.4
City
Seattle
Atlanta
Detroit
Area
84
130
139
Elevation Radio
14
40
1050
27
601
59
City
Seattle
State Pop
% For
WA
532900 13.1
Area
84
Elevation
14
Radio
40
Detroit
MI
139
601
59
1027974
3.4
• Also called “inner join”: Statistics * Geography
• Cartesian product and join are binary operations
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Joins in SQL
• SELECT City, State, Radio
FROM Statistics
INNER JOIN Geography
ON Statistics.City = Geography.City
Statistics:
Geography:
City
Seattle
Atlanta
Detroit
City
State Pop
% For
Seattle WA
532900 13.1
Detroit MI
Result of Query:
1027974
3.4
City
Seattle
State
WA
Detroit
MI
20-751 ECOMMERCE TECHNOLOGY
Area
84
130
139
Elevation Radio
14
40
1050
27
601
59
Radio
40
59
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Other SQL Constructs
• ORDERBY (sorting)
– SELECT Company, OrderNumber FROM Orders ORDER
BY Company;
• BETWEEN
– SELECT * FROM Persons WHERE LastName BETWEEN
'Hansen' AND 'Pettersen‘;
• ALTER TABLE (changes table structure)
• Functions
– SUM()
– COUNT()
– MAX()
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Database Constraints
• Domain (data validity) constraints
– All values in a column must be from the same
domain
– Example: all salaries are positive numeric dollar
amounts. “Monthly” is invalid.
• Entity Integrity
– Every entity must have a unique primary key.
(Otherwise, can’t access the entity)
• Referential Integrity
– Every foreign key value in a relation must match a
primary key in the foreign relation table
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Functional Dependency
• Attribute B is functionally dependent on attribute
A if the value of A uniquely determines B
– One-to-one relationship: two functional
dependencies: A depends on B; B depends on A
– Many-to-one relationship: one functional
dependency: B depends on A
– Many-to-many relationship: no dependencies:
neither A nor B depends on the other
• Functional dependencies are constraints between
attributes or sets of attributes. They must be
maintained or error or inconsistency will result.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Normalization
• A relation is well-structured if it is non-redundant and allows
INSERT, MODIFY and DELETE without error or inconsistency.
• Normalization assists in maintaining functional dependencies
and preventing errors and inconsistencies.
• DELETE anomaly:
Student Email
Course
Smith
smithj@andrew 20-751
Room
152
Smith
Jones
152
150
smithj@andrew 20-753
jonesj@cs
46-870
• Deleting “Jones” removes all information about course 46-870
(namely that its room is 150)
• In the information is in another table, it shouldn’t be here also.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Normalization
• MODIFY anomaly:
Student Email
Course
Smith
smithj@andrew 20-751
Room
152
Smith
Jones
152
150
smithj@andrew 20-753
jonesj@cs
46-870
• Suppose Smith’s email address changes. Every line in the table
corresponding to Smith must be changed or data will be
inconsistent.
• An attribute unique to a key should be entered only once in the
database.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Normalization
• Restructuring to produce smaller, well-structured equivalent
relations, reduce data replication
• First Normal Form. Make all attributes atomic. No multiple
values.
Name
Phone
Carbonell 83064, 87279
Dept
CS, LTI
Reddy
CS, Robotics WeH
82597, 87170
Bldg
WeH, NSH
MULTIPLE
VALUES
FIRST
NORMAL
FORM:
MULTIPLE
VALUES
Name
Phone
Carbonell 83064
Dept
CS
Bldg
WeH
Carbonell 87279
Reddy
82597
Reddy
87170
LTI
CS
Robotics
NSH
WeH
WeH
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Second Normal Form
• Eliminate partial functional dependencies. Every non-key
attribute must depend on all key attributes (or redundancy can
result).
Capital
KEY IS
(City, State)
NOT IN 2NF:
2NF:
DECOMPOSE
INTO TWO
TABLES
City
Philadelphia
State Capital
PA
Harrisburg
Pittsburgh
Detroit
PA
MI
City
Philadelphia
State
PA
Pittsburgh
Detroit
PA
MI
Harrisburg
Ann Arbor
City Pop.
1478002
1336449
1027974
City Pop.
1478002
DEPENDS ON
State ONLY,
NOT CITY
1336449
1027974
State Capital
PA
Harrisburg
MI
Ann Arbor
There are many other normal forms and normalization rules
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Entity-Relationship (ER) Diagrams
• Must specify:
– Entities (things to be represented in the database)
– Attributes (properties of entities)
– Relationships (relations among entities)
• These can be modeled by entity-relationship
diagrams
• The diagrams are used as a guide to designing the
database
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Entity-Relationship (ER) Diagrams
• Entity types
– Entity type: Store
– Entities: Downtown Store, Squirrel Hill Store, Oakland Store
• Relationships between entity types:
• This is the “Has” relationship
• Direction of arrow is important (“Branch has Staff,”
not “Staff Has Branch”)
EXAMPLE FROM CONNOLLY & BEGG
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Entity-Relationship (ER) Diagrams
• Relationships need not be binary:
• This is the “Arranges” relationship; it can be though of
as a 4-tuple (Solicitor, Bid, Buyer, Institution)
EXAMPLE FROM CONNOLLY & BEGG
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Entity-Relationship (ER) Diagrams
EXAMPLE FROM CONNOLLY & BEGG
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Entity-Relationship (ER) Diagrams
EXAMPLE FROM CONNOLLY & BEGG
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Web Database Connectivity
JDBC = Java Database
Connectivity
SQLJ = Java-Embedded
SQL
SOURCE: CONNOLLY & BEGG
Distributed Databases
• Databases in which data is stored in more than one
location but appears local to the user
– Replicated: multiple copies of database
– Partitioned: data is split among locations
• Fragmentation
– Information about fragments is stored in a
distributed data catalog (DDC)
– Horizontal v. vertical fragmentation
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Distributed Databases
• Advantages
– Reduced load on central DB
– Lower cost (data spread among small machines)
– Reliability (machine failure is not fatal)
– Fast access to local data
– Ease of growth
• Disadvantages
– Complexity. Difficult to maintain consistency
– Security (many access points)
– Telecommunications required
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Distributed Databases
• Products
– PeerDirect
• Issues
– Updating of information in a distributed database
is a form of transaction processing
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
What is a Transaction?
• An action requiring a series of steps and database
updates.
• Transactions are the basis of Ecommerce.
• Transactions may be distributed. Steps processed
on different computers.
• Transactions may fail. One or more steps may be
unsuccessful.
• Transaction systems must be recoverable. Data and
“state” must be restored after failure.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
ATM Transaction Database
ACCT
536-0178
536-9112
PIN
AEZ22
71BZZ
BALANCE
$ 1013.22
$ 25561.18
557-0308
82A6Z
$
599-8133
632-0012
K8LL0
R3TTP
$ 1622.77
$ 12016.45
ACCOUNT MASTER
STOLEN CARDS
3278.08
RECENT ACTIVITY
POST I NG LOG
ACCT
557-0308
536-9112
RECENT
$ 150.00
$ 6700.00
542-0061
$
240.00
ACCT
536-0178
536-0178
599-8133
610-0518
$ 5500.00
$
50.00
107-0003
599-8133
20-751 ECOMMERCE TECHNOLOGY
STOLEN
421-2254
536-9112
542-1613
599-0028
613-4299
667-0033
ATM
TIME
UID
AMOUNT
543 10:08:32 0005148
-200.00
543 10:09:21 0005154
-200.00
391 10:10:06 0005167
422 10:11:15 0005174
SUMMER 2003
300.00
-75.00
COPYRIGHT © 2003 MICHAEL I. SHAMOS
ATM Withdrawal
1
2
3
4
5
6
7
8
9
10
11
Check STOLEN
Check PIN
Check RECENT v. BALANCE
Check ATM reserve
Update RECENT
Update BALANCE
Write to Log
Update ATM reserve
Tell ATM to dispense money
Check dispensing status
Make updates permanent
20-751 ECOMMERCE TECHNOLOGY
If card is stolen, ABORT
If wrong, retry 3 times, ABORT
Too much activity, ABORT
If not enough money, ABORT
Indicate new activity
Debit bank account
Record transaction
Debit ATM balance
Pay the man
If failed, ABORT
COMMIT the transaction
(Think: “to memory”)
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
ACID
• Four minimum requirements of a transaction
T in a transaction system:
• Atomic. T executes completely or not at all.
• Consistent. T preserves database
consistency and integrity.
• Isolated. T executes as if it were running
alone. Not affected by other concurrent
transactions.
• Durable. T’s results preserved during failure.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Atomicity
•
•
•
•
Transaction: John pays Mary $100.
Take $100 out of John’s account.
Add $100 to Mary’s account
Problems:
– John or Mary might not have an account
– John might not have $100
– System might fail after subtracting $100 from John
• If failure occurs, must undo partial results
• “Commit”: successful recording of a transaction
• “Abort”: failure of a transaction.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Consistency
• Maintain database constraints
–
–
–
–
data validity
unique primary keys
referential integrity
conservation conditions (debits = credits, total
cash = sub of cash in all accounts, etc.)
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Isolation
• Two transactions T1, T2 are interleaved if some steps
of one are performed after the other starts but before
it completes.
T1 has steps A B C D E F; T2 has steps P Q R S
A B P C D Q R S E F is an interleaved schedule.
P Q R S A B C D E F is not interleaved.
• A sequence of transactions is isolated if their steps
can be interleaved without affecting the result.
Transactions are blind to simultaneous execution.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Durability
• Results must survive failure.
• Logging. Maintaining a record of all data updates so
databases can be repaired if failure occurs.
• Updates must be logged before they are performed.
If failure occurs, transaction can complete from failure
point.
• If abort is necessary, can undo logged transactions.
• Without logging, can’t recover from some types of
failures.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Simultaneous Transactions
• If all TP were done by one single-threaded process, it
would be easy. Just execute one step at a time.
• With just two threads (or processes) it’s complicated.
2 transactions T1, T2: READ A; A = A+1; WRITE A;
A in DB
5
5
T1 STEP READ A
A in T1
5
T2 STEP
A in T2
5
5
6
A=A+1
5
6
READ A
5
5
6
WRITE A
6
6
A=A+1
WRITE A
6
6
6
• Value of A is 6, but it should be 7!
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Locking
• A solution is LOCKING. Associate a variable with
one process at a time. LOCK the others out.
• LOCK A; READ A; A = A+1; WRITE A; UNLOCK A;
• If T1 starts first, it locks A.
• When T2 tries to lock A, it can’t. It has to wait.
• T1 finishes completely before T2 can lock A.
• After T1 finishes, A = 6
• After T2 finishes, A = 7, the correct value
• Locking achieves atomicity
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Deadlock
T1: LOCK A; LOCK B; B=A+B; UNLOCK A; UNLOCK B;
T2: LOCK B; LOCK A; B=A - B; UNLOCK B; UNLOCK A;
A in DB
B in DB
17
3
17
3
T1 STEP LOCK A
T2 STEP
17
3
17
3
17
3
LOCK B
CAN'T EXECUTE
LOCK B
LOCK A
CAN'T EXECUTE
This is deadlock. Neither transaction can complete.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Ways to Eliminate Deadlock
1. Require each transaction to request all locks at the
same time. System either grants them all or none.
• Problem: very restrictive. Transactions cannot be
interleaved. Essentially serial execution.
2. Assign an ordering to the variables: A=1; B=2; C=3 …
Require transactions to request locks in that order.
3. Do nothing. Periodically check for deadlock. If it
exists, cancel out a transaction.
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS
Q&A
20-751 ECOMMERCE TECHNOLOGY
SUMMER 2003
COPYRIGHT © 2003 MICHAEL I. SHAMOS