Transcript Document

CSCI485 – File & Database Management Systems
Bahram Zartoshty
Office: SAL 346
Phone: TBA
Office Hours: TTH 1:15-2:50pm
Note: Parts of this lecture were developed by
Professor Ghandeharizadeh
1
Logistics

Required text book:


Database System Concepts, Silberschatz,
Korth & Sudarshan, Fifth edition.
Pre-req for the course:


CS201: Data Structures
Knowledge of an object-oriented
programming language such as C++,
Java, C#
Teaching Assistant
Shahin Shayande
Office: (Microsoft Lab)SAL 200C
Office Hours: TBA
3
Grading



Midterm 1: 35%
Midterm 2: 35%
Project & Assignments: 30%
4
What to do immediately?

Register with the web site
http://dblab.usc.edu/csci485
5
Database Management System (DBMS)

Database: An integrated collection of data, usually
stored on secondary storage, typically describing
the activities of one or more related organizations.

Database management system (DBMS): A collection
of software/programs designed to assist in
maintaining and utilizing large collections of data.

DBMS contains information about a particular
enterprise
Used almost on a daily basis for either individual or
business use.
Relational database vendors were one of the fastest
growing sectors during the .COM boom!


6
BEFORE DBMS
In the early days, database applications were built on top of file systems
User
1
User
2
Application
programs
Application
programs
7
Data
Data
AFTER DBMS
User
1
Application
programs
User
2
Application
programs
DBMS
8
Data
managed
by DBMS
WHY A DBMS?
1.
2.
3.
4.
5.
6.
7.
8.
Reduced application development time
Data independence: Application programs not
dependent on data representation and storage details
Data sharing: data is better utilized (discovered and
reused), redundancy of data is minimized
Data integrity and consistency: one may enforce
consistency constraints on data, e.g., number of seats
sold ≤ number of seats on the plane × 1.1
Centralized control: DBA tunes the database to balance
user's needs
Security: mechanisms to prevent unauthorized access.
These mechanisms are based on content instead of fileoriented approach.
Concurrency control: avoids undesirable race
conditions that arise with simultaneous access/updates
to data
Crash recovery: ensures the integrity of data in the
presence of failures
9
DATABASE MANAGEMENT
SYSTEMS ARCHITECTURE
User
1
DBMS
User
n
DB
Physical
data
Conceptual
schema
10
Data Models






A collection of tools for describing

Data

Data relationships

Data semantics

Data constraints
Relational model
Entity-Relationship data model (mainly for database
design)
Object-based data models (Object-oriented and Objectrelational)
Semistructured data model (XML)
Other older models:

Network model

Hierarchical model
11
Challenges
Conceptual
Logical
Physical
Abstraction,
Inheritance,
Encapsulation
Reduction to tables
with minimal: data
duplication,
potential for data
loss and update
anomalies
Effective use of a DBMS,
management of
mismatch between
tables and OO
constructs, Index
structures, CC & Crash
recovery, Optimization
techniques
Conceptual Data Models

Entity-Relationship (ER) data model

Entities, Attributes, Relationships
SS#
name
address
Emp
Conceptual Data Models

Entity-Relationship (ER) data model

Entities, Attributes, Relationships
Co-Pay
SS#
name
address
Emp
Enrolled
in
Health
Plan
name
Conceptual Data Models

Entity-Relationship (ER) data model


Entities, Attributes, Relationships
Recursive relationships
SS#
name
address
Emp
Married
to
Conceptual Data Models

Entity-Relationship (ER) data model


Entities, Attributes, Relationships
Recursive relationships
SS#
name
address
Emp
Works
for
Conceptual Data Models

Entity-Relationship (ER) data model


Entities, Attributes, Relationships
Recursive relationships
SS#
name
address
Emp
Works
for
date
Conceptual Data Models

Entity-Relationship (ER) data model



Entities, Attributes, Relationships
Recursive relationships
Inheritance
student
sid
name
Generalization
Undergrad
ISA
Specialization
graduate
Conceptual Data Models


Abstraction, Inheritance,
Encapsulation
Exercise these concepts using in-class
examples and homework assignments

A library database contains a listing of authors who have written
books on various subjects (one author per book). It also contains
information about libraries that carry books on various subjects.
Conceptual Data Models


Abstraction, Inheritance,
Encapsulation
Exercise these concepts using in-class
examples and homework assignments



A library database contains a listing of authors who have written
books on various subjects (one author per book). It also contains
information about libraries that carry books on various subjects.
Entity sets: authors, subjects, books, libraries
Relationship sets: wrote, carry, indexed
Conceptual Data Models


Abstraction, Inheritance,
Encapsulation
Exercise these concepts using in-class
examples and homework assignments

A library database contains a listing of authors who have written
books on various subjects (one author per book). It also contains
information about libraries that carry books on various subjects.
title
Subject
matter
isbn
SS#
authors
wrote
books
libraries
carry
name
address
index
subject
Data Models
SS#
name Emp
address
Works
for
Logical
Physical
Relational Data Model

Prevalent in today’s market place.

Why? Performance!

Everything is a table!

Logical data design is the process of
reducing an ER diagram to a collection
of tables.
Logical Data Design

Trivial reduction:



An entity set = a table
A relationship set = a table
Pitfalls:



Duplication of data
Unintentional loss of data
Data ambiguity that impacts software
design, resulting in update anomalies
Data Duplication
SS#
Emp
name
Works
for
address
SS#
Name
Address
SS#
396
Shahram
Seattle
396
400
400
Asoke
Chicago
200
400
120
400
200
Joe
New York
MGR
SS#
Data Duplication
SS#
Emp
name
Works
for
address
SS#
Name
Address
SS#
396
Shahram
Seattle
396
400
400
Asoke
Chicago
200
400
120
400
200

Joe
MGR
SS#
New York
The SS# column is duplicated!
Data Duplication: Solution

Merge the two tables into one:
SS#
Emp
name
Works
for
address
SS#
Name
Address
MGR
SS#
396
Shahram
Seattle
400
400
Asoke
Chicago
NULL
200
Joe
New York
400
Data Loss


Ford maintains warehouses containing
different automobile parts
Part#
Description
Location
123
Piston
Tijuana
203
Cylinder
Michigan
877
Bumper
Michigan
389
Seats
Arizona
Records are inserted and deleted
based on availability of a part at a
warehouse
Data Loss (Cont…)


When a warehouse becomes empty, it
is lost from the database:
Part#
Description
Location
123
Piston
Tijuana
389
Seats
Arizona
Solution: utilize two different tables
Part#
Description
WHID
WHID
Location
123
Piston
12
12
Tijuana
389
Seats
45
45
Arizona
Data Ambiguity


Represent faculty of a department as:
Faculty
Department Location
Ghandeharizadeh
Comp Sci
SAL
Zartoshty
Comp Sci
SAL
Bohem
Comp Sci
SAL
A change of address for a faculty might
be for the entire department. This
cannot be differentiated with this table
design!
Data Ambiguity

Utilize two tables:
Faculty
Department
Department Location
Ghandeharizadeh
Zartoshty
Jenkins
Bohem
Comp Sci
Comp Sci
Bio Medical
Comp Sci
Comp Sci
SAL
Sex Ed
BOVARD
Bio Medical
HEDCO
Data Ambiguity (Cont…)


Employees of a bi-lingual company
having different skills.
Employee
Skill
Language
Asoke
Teach
Hindi
Asoke
Cook
French
Asoke
Null
German
Asoke
Program
English
Update anomalies!
Data Ambiguity: Solution

Utilize two tables:
Employee
Employee
Language
Asoke
Hindi
Asoke
French
Asoke
German
Asoke
English
Skill
Asoke
Teach
Asoke
Cook
Asoke
Program
Logical Data Design

A quest to flatten objects with minimal
data duplication, loss of data, and
update anomalies!

William Kent, “A Simple Guide to Five
Normal Forms in Relational Database
Theory”, Communications of the ACM
26(2), Feb 1983, 120-125.
Data Models
SS#
name Emp
address
Works
for
Logical Data Design
SS#
Name
Address
MGR SS#
396
Shahram
Seattle
400
400
Asoke
Chicago
Null
Physical
Physical Implementation

Reconstruct main memory objects for
manipulation and presentation:

Specify class definitions




Typically correspond to entity-sets
Populate an instance of a class by issuing
SQL queries to a DBMS
Update instances in memory
Flush dirty instances back to DBMS

Potential use of transactions
Type Mismatch

A column of a row must be a primitive
such as an integer, real, etc.



It may NOT be an array of integers or
object pointers
A property (attribute) of a class might
be of a multi-valued type, e.g., an array,
a vector, etc.
Changes in software may impact the
design of tables. (Management of type
mismatch by the system designer.)
Implementation

Set operators in the DBMS



Does set A contain set B?
Does value v1 appear in set A?
Aggregates in the DBMS



Compute average employee salary
Count the number of employees
Find the oldest employee
Challenges
Conceptual
Logical
Physical
Abstraction,
Inheritance,
Encapsulation
Reduction to tables
with minimal: data
duplication,
potential for data
loss and update
anomalies
Effective use of a DBMS,
management of
mismatch between
tables and OO
constructs, Index
structures, CC & Crash
recovery, Optimization
techniques
Entity-Relationship Model
Example of schema in the entity-relationship model
40
Entity Relationship Model (Cont.)


E-R model of real world
 Entities (objects)
 E.g. customers, accounts, bank branch
 Relationships between entities
 E.g. Account A-101 is held by customer
Johnson
 Relationship set depositor associates
customers with accounts
Widely used for database design
 Database design in E-R model usually converted
to design in the relational model (coming up
next) which is used for storage and processing
41
Relational Model

Example of tabular data in the
relational model
42
Attributes
A Sample Relational Database
43
DATA INDEPENDENCE
1.
2.
Physical data independence: modify the
physical scheme (data structures, e.g., Btree or hash index) without causing
application programs to be rewritten.
These modifications are necessary to
enhance performance and new software
releases. Most relational vendors
support this kind of data independence.
Logical data independence: Modify the
conceptual scheme (e.g., add a new
attribute to a table, rename an attribute)
without causing application programs to
be rewritten. This kind of data
independence is harder to achieve.
44
DATABASE LANGUAGES
There are several languages associated with a database:
Data Definition Language (DDL): The database scheme is
specified by a set of definitions that are expressed by a
special language named DDL. The result of compiling
DDL statements is a set of tables stored in a file called
data dictionary. This file contains meta-data (data about
the data stored in the database).
Data Manipulation Language (DML): a language that
enables users to access or manipulate data (retrieve,
insert, replace, delete) as organized by a certain data
model. We will look at a commercial DML named SQL. In
general, there are two types of DML:
1.
2.
•
•
Procedural: Describes what data is needed and how to get it:
e.g., relational algebra
Non-procedural: Describes what data is needed without
specifying how to get it: e.g., tuple relational calculus
45
SQL

SQL: widely used non-procedural
language



E.g. find the name of the customer with customer-id 192-83-7465
select customer.customer-name
from customer
where customer.customer-id = ‘192-83-7465’
E.g. find the balances of all accounts held by the customer with
customer-id 192-83-7465
select account.balance
from depositor, account
where depositor.customer-id = ‘192-83-7465’ and
depositor.account-number =
account.account-number
Application programs generally access
databases through one of


Language extensions to allow embedded SQL
Application program interface (e.g. ODBC/JDBC) which allow SQL
queries to be sent to a database
46
Storage Management



Storage manager is a program module that provides
the interface between the low-level data stored in
the database and the application programs and
queries submitted to the system.
The storage manager is responsible to the following
tasks:
 Interaction with the file manager
 Efficient storing, retrieving and updating of data
Issues:
 Storage access
 File organization
 Indexing and hashing
47
Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
48
Query Processing (Cont.)



Alternative ways of evaluating a given query
 Equivalent expressions
 Different algorithms for each operation
Cost difference between a good and a bad way of
evaluating a query can be enormous
Need to estimate the cost of operations
 Depends critically on statistical information
about relations which the database must
maintain
 Need to estimate statistics for intermediate
results to compute cost of complex
expressions
49
Transaction Management



A transaction is a collection of operations that
performs a single logical function in a database
application
Transaction-management component ensures
that the database remains in a consistent
(correct) state despite system failures (e.g.,
power failures and operating system crashes)
and transaction failures.
Concurrency-control manager controls the
interaction among the concurrent transactions,
to ensure the consistency of the database.
50
SYSTEM USERS




There are several kind of users associated
with a system:
Database administrator: defines schemas,
storage structures and access method
definitions, physical organization,
authorization, integrity constraints.
Application programmers: they write a
program and make it available to the endusers
Sophisticated users: they use a query
language (SQL) to access the database
interactively
Naive (end) users: they invoke the application
programs
51
Overall System Structure
52