Introduction to Databases, CS 3/586 Fall 2009 Instructor: Len Shapiro, len at pdx.edu Syllabus: www.cs.pdx.edu/~len/386 This work, and all other files in this series,

Download Report

Transcript Introduction to Databases, CS 3/586 Fall 2009 Instructor: Len Shapiro, len at pdx.edu Syllabus: www.cs.pdx.edu/~len/386 This work, and all other files in this series,

Introduction to Databases, CS 3/586
Fall 2009
Instructor: Len Shapiro, len at pdx.edu
Syllabus: www.cs.pdx.edu/~len/386
This work, and all other files in this series, are licensed under the Creative Commons
Attribution License. To view a copy of this license, visit http://creativecommons.org/licenses
or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco,
California, 94105, USA.
These slides have benefitted from the content of slides developed by Raghu Ramakrishnan
and Johannes Gehrke, Lois Delcambre and David Maier, all of whom I thank.
These slides are best viewed with “slide show”. To print the slides, consider using
"File/Print/Properties/Pages per sheet" instead of Power Point's "handouts" mode.
Slides will be ready by 6PM each Monday evening.
11/7/2015
1
Table of Contents
• Introduction
– Translating Data into Information
– Three Grand Challenges
• Data is widespread
• Data is vast
• Structured vs Unstructured Data
• ER Diagrams
– Requirements Analysis
• Use Cases
–
–
–
–
–
–
Entities, Attributes, Relationships
Instance, legal instance
Keys, Primary and Candidate
Attributes of Relationship Sets
Ternary Relationship Sets
Enties vs Attributes
11/7/2015
• Relational Data Model
–
–
–
–
–
Relation, row, attributes, instance
Cardinality, Degree, Domain
Keys, Primary and Candidate
Schema, Legal Instance
Finding Candidate & Foreign Keys
• ER Diag.  Schema of Tables
• Database, DBMS
–
–
–
–
–
Definitions
Languages
Postgresql
FEC data
SQL
•
•
•
•
•
History
NULL values
DDL, DML
SELECT/FROM/WHERE
DISTINCT,ORDER BY,LIKE
2
Prerequisites
• The prerequisites for this course are CS 161 and 250.
• This week's class will expect you to know the definitions
of these terms from CS250: sets, bags, tuples, relations.
• Please review these concepts from the text that you
used for CS250 or its equivalent.
• The text that we use at PSU is [James Hein, Discrete
Structures, Logic and Compatibility, 3rd Edition, Jones
and Bartlett, 2009].
11/7/2015
3
Learning Objectives*
LO1.1 Write an ER diagram from an RA using
UML, including primary keys, cardinality
constraints, attributes of relationship sets, and
ternary relationship sets
LO1.2 Given a schema, find all foreign keys for
legal instances of the schema.
LO1.3 Translate an ER diagram into a schema of
tables, preserving all information.
*An asterisk in the title means that there is info on the notes page (View/Notes page).
This info is either what I say in class, or answers to questions posed on the slides. I
recommend that you NOT print the notes page in the copy of the slides that you bring to
class, but look at the notes page if you miss class or if your notes are incomplete.
11/7/2015
4
Information
• Digital Information is precious
• Modern business, culture and society could not exist in
its present form without digital information.
• Think about how often you use digital information in your
life.
• If most modern businesses lost their current and backup
information they would be candidates for bankruptcy.
• But nature gives us data, not information.
• Data: disorganized, slow to access,
understand and visualize
11/7/2015
5
Data vs. Information
• Data has its place. It’s not useless.
Data
Information
Atlas
Encyclopedia
Brick and Mortar Store
Printout
Yahoo Maps, Mapquest
Wikipedia, Google Search
Online Store, e.g., Amazon
Data Access Form
• In other CS courses, you learn how to manipulate (inmemory) data, using programs.
• In CS386/586 you will learn to transform data into
information and manage that information.
11/7/2015
6
Three Challenges of Data*
1. Data is widespread – accessible by the Internet
– Covered in CS4/594, Internetworking Protocols
2. Data is vast - Much too large to fit in memory
•
•
•
•
•
Library of congress: 20 terabytes (see notes view for units)
Photos uploaded to Facebook each month: 20 terabytes
Amazon.com: 42 terabytes
Choicepoint: 250 terabytes
Data processed by Google’s servers each hour: 1 Petabyte
– In this class you’ll learn how to
•
•
•
11/7/2015
Transform data into information
Manage disk-based information
Manage information in parallel
7
Database Challenges 3: Data Structure*
•
Some data, e.g., business data, has clearly defined attributes. In
this class you’ll learn how to transform such structured data into
information by the Databasics Anonymous 7-step method**:
1.
2.
3.
4.
5.
Organizing structured data into an ER Diagram, Ch. 2
Transforming an ER diagram into a Schema of Tables, Ch. 3
Eliminating anomalies from those tables (normalization), Ch. 21
Structuring those tables efficiently (physical design), Ch. 22
Managing those tables using the intergalactic standard language
(SQL), Ch. 5
•
The DBMS manages the tables internally using Relational Algebra, Ch. 4
6. Protecting those tables during concurrent use (ACID properties), Ch.
16
7. Accessing those tables through a web-based interface (some scripting
language)
**I chose this name out of respect for Alchoholics Anonymous, which has saved innumerable lives
11/7/2015
8
Data Challenges 3: Structure (ctd)
• Some data’s attributes are not clearly defined, e.g.,
documents, web pages
– The attributes are words in the data or tags
– In this class you’ll learn how to transform unstructured data into
information by:
• Building data structures to make retrieval efficient (inverted indexes)
• Ranking retrieval results (Google’s page rank algorithm)
• Measuring the effectiveness of querying unstructured data (recall)
11/7/2015
9
A Brief Database History*
1960s: Network and hierarchical models of structured data
1970s: E.F. Codd's relational model, implemented in Ingres and
System R
1976: P. Chen's ER model
1980s: Relational systems/SQL dominate, PC DBMSs become
popular
1990s: Client-server architecture, object-oriented model, Excel,
WWW, Google
11/7/2015
10
Overview
•
This ends our introduction
•
First we will study structured data, for which the
intergalactic standard model is the relational model.
•
Recall the Databasics Anonymous 7-step method**:
1.
2.
3.
4.
5.
Organizing structured data into an ER Diagram, Ch. 2
Transforming an ER diagram into a Schema of Tables, Ch. 3
Eliminating anomalies from those tables (normalization), Ch. 21
Structuring those tables efficiently (physical design), Ch. 22
Managing those tables using the intergalactic standard language
(SQL), Ch. 5
1.
6.
7.
•
The DBMS manages the tables internally using Relational Algebra, Ch. 4
Protecting those tables during concurrent use (ACID properties), Ch.
16
Accessing those tables through a web-based interface (some scripting
language)
We will perform steps 1, 2 and part of 5 this week.
11/7/2015
11
Election Data
• For examples in this course we’ll be using data, collected by the
Federal Elections Commission, about contributions to campaign
committees, from the 2008 election*.
• General info about the FEC is at
http://www.fec.gov/ans/answers_general.shtml .
• Forms-based interfaces to the data are at
http://www.fec.gov/finance/disclosure/disclosure_data_search.shtml
• Raw data in tabular form is at
http://www.fec.gov/finance/disclosure/ftpdet.shtml
• Our, and all database texts, use simple data examples. We hope
that by using real data you’ll not only gain more valuable experience
but be able to ask interesting questions of the data.
• Perhaps you’ll make some newsworthy discoveries.
*This idea was suggested by Vassilis Papadimos, whom we thank.
11/7/2015
12
Raw Data*
• When you first encounter data, it may be in a raw,
disorganized form. In front of you, and below, is some
raw FEC data. Note
– Some data (at the top) is descriptive, some (at the bottom) is in
tables.
– Some data items are self contained
• But note “at the same address as his principal committee”
– Some notation needs human interpretation
• ‘names all start with “McCain Victory” and end with one of…’
Raw Data Document in  http://www.cs.pdx.edu/~len/386/fec/Mats.doc
11/7/2015
13
Real World Data
• Real world data often starts out like the raw data in front
of you, except that there is often much more data and
many more attributes.
• The first step in transforming this raw data into
information is to draw a diagram of it, with the hope that
your customer can understand the diagram and thus the
structure of the data.
• One way to make the diagram understandable is to use
few concepts in the diagram: primarily entity and
relationship.
11/7/2015
14
Entity and Relationship
• An entity is a real-world object distinguishable
from other objects.
– An entity is described using a set of attributes.
• A relationship is an association among 2 or more
entities.
• What are some possible entities, relationships
and attributes in the raw FEC data?
11/7/2015
15
ER Diagram of FEC Data*
Candidate
CandName
Party
Address
Principal
Associated
Committee
CommName
Address
Entity Set
Attribute
Donated to
Relationship Set
Donor
Occupation
11/7/2015
16
Entity, Relationship Sets*
• There is a subtle distinction here:
– Gordon Smith, a candidate, is an entity.
– “Candidate” is the name of the set of all candidates and is called
an entity set.
– Similarly, the pair (Jeff Merkley, Jeff Merkley for Oregon) is a
relationship, but “Principal” is a relationship set.
• What are some examples of entities, entity sets,
relationships ,and relationship sets?
• I will sometimes be sloppy and refer to Candidate,
Committee, etc as entities, but that is not precise.
• There is no such distinction for attributes
11/7/2015
17
Requirements Analysis
• The standard wisdom (from our text and CS300) is to
transform the raw data, as in the handout, into a
Requirements Analysis, or RA, as on the next page, then
to transform the RA into an ER diagram.
• From now on we’ll give you an RA and expect you to
transform it into an ER diagram.
11/7/2015
18
Requirements Analysis
• Store information about employees, departments and
projects
– Each employee has a ssn and name
– Each department has a code and a name
– Each project has a number, name, due date and budget
– Employees are assigned to projects
– Departments sponsor projects
– Each department has a manager and each employee has a
home department
11/7/2015
19
Guidelines
• If something has an attribute, it is an entity set
• Use language from the RA in the ER diagram
• Relationship set names should be part of a sentence
including the names of the entity sets.
• Some parts of the RA may not belong in the ER
diagram.
• Keep the ER diagram as simple as possible.
– No redundancy
• Check that every item in the RA is in the ER diagram if
it belongs there, and nothing else.
11/7/2015
20
Your Emp-Dept ER Diagram*
11/7/2015
21
Instance, Legal Instance*
• An instance of an entity set is the current contents of the
entity set.
• A legal instance is a an instance that could occur in the
real world.
– What is an example of an illegal instance?
• Normally you need a domain expert to determine what is
a legal instance.
• For example, can two departments have the same
name? The same code? The same name and the same
code both? It depends on company policies – only a
domain expert will know.
11/7/2015
22
Keys*
• A key is a minimal set of attributes that uniquely defines an entity in an
entity set, for all legal instances of the entity set.
– Is name a key for Employee? (Number,Name) a key for Project?
– You don't know unless you ask a domain expert.
• If there is more than one key in an entity set, one is chosen as the
primary key; the others are called candidate keys.
– The primary key is usually the one by which the data items are most often
accessed
• In ER diagrams, the primary key is underlined.
• PRACTICE: Underline primary keys in the Emp-Dept ER diagram.
– Len is the domain expert.
• Note that some ER diagrams do not have keys, e.g., the FEC data. It
is common, but not required, during the transformation from data to
information, to add keys at some point.
11/7/2015
23
Requirements Analysis:
Professors
• Professors have a SSN, a name and an age and their SSNs
uniquely identify them.
• There is a bulletin. In the bulletin, each course is listed with
the professor who supervises the course and the professors
who teach the course.
• Courses are uniquely identified by their Quarter and CRN,
and they have a name.
• Sample data: CS386 taught by Len and Lois, supervised by
Dave
• PRACTICE: Draw an ER Diagram, including primary keys
– INCLUDE ONLY INFO IN THE RA!
11/7/2015
24
Your Prof ER Diagram*
11/7/2015
25
Sidebar: Use cases
• In Homework 7 you will need to define 3-5 use cases for
your application to illustrate how your users will interact
with your application.
• I'll give you an example so you can begin thinking about
that assignment.
• A use case describes how an actor interacts with the
system you have built.
• Here's an example use case for the Prof-Course ER
diagram we just produced, assuming it has been
implemented as a system.
11/7/2015
26
Example Use Case:
Find Course Supervisor
Actor: Student who wants to find course supervisor
1. Student logs into Banner
a. If login is invalid, issue error message and return to step 1
2. Student chooses "find course supervisor" from menu.
3. Student inserts quarter and CRN from dropdown list.
4. System returns course supervisor information.
5. Student chooses "exit" from menu.
11/7/2015
27
Notes about Use Cases
• Each use case must have a name and an actor. The
actor may be another system, for example if another
company is purchasing an item.
• Each step of the use case is doable by the actor or the
system you have built.
• Error cases can be handled inline, as in the example, or
at the end of the use case.
• In order to find uses cases, think of who users are, what
they will do, and/or what screens will look like.
• Use the given format: start with the actor and number the
steps.
11/7/2015
28
Cardinality Constraints on Relship sets:
How many entities can participate?
Candidate
i..j
Associated
m..n
CandName
Party
Address
Committee
CommName
Address
• m..n means each candidate is the associated candidate for at least
m and at most n different committees. i..j means that each
committee has at least i and at most j associated candidates.
– What are reasonable values for m, n, i and j?
Don’t forget to skip over.
11/7/2015
29
FEC With Cardinality Constraints*
• Fill in the correct cardinalities
Candidate
CandName
Party
Address
Principal
0..1
Associated
0..*
Committee
CommName
Address
Donated to
Donor
Occupation
A domain expert, who knows the semantics of the
application, is needed to attach correct cardinalities.
11/7/2015
30
More Cardinality Constraints
• A relationship like Associated is one-to-many if one row
is Associated with many rows but not vice versa.
11/7/2015
31
More Cardinality Concepts
• Many to Many Relationships
– A left entity is related to many right entities and vice versa.
– Which of the FEC relationships is many to many?
– How can you tell that a relationship is many to many
• The max cardinality on both sides is greater than one
• One to Many Relationships
– A left entity is related to many right entities but not vice versa.
– Which of the FEC relationships is one to many?
• Max cardinality on at most one side is greater than one
• One to One Relationships
– A left entity is related to exactly one right entity, and vice versa.
– Which of the relationships is one to one?
• Cardinality on both sides is 1..1
• Relatively rare, means two entity sets correspond exactly.
11/7/2015
32
FEC with new Cardinalities
Candidate
CandName
Party
Address
One to Many
0..*
Principal
0..1
0..1
Associated
0..*
One to Many
Committee
CommName
Address
1..*
Donated to
Many to Many
0..*
Donor
Occupation
11/7/2015
33
Relationship sets can have
attributes
Employee
Ssn
Name
0..*
home
0..1
Department
Code
Name
lot
start-date
descriptive attribute
of the relationship set
11/7/2015
34
Try all three locations for the attributes:
What does each one mean?
Employee
Ssn
Name
0..*
home
0..1
Department
code
name
Lot
start-date
?
11/7/2015
start-date
?
start-date
?
35
What is a possible attribute for the
DonatedTo Relationship Set?*
Candidate
CandName
Party
Address
0..1
Principal
1..1
0..1
Associated
0..*
Committee
CommName
Address
1..*
Donated to
0..*
Donor
Occupation
11/7/2015
36
What’s wrong with this picture?
Doctor
Name
Address
Specialty
Primary Care
Appointment
Patient
Ssn
Address
Insurance ID
Time
11/7/2015
37
Binary vs. Ternary Relship Sets*
• A classic example of a ternary relationship set is Parent,
between the entity sets father, mother and child.
– The alternative is to store the data in binary relationship sets
Father and Mother.
Father
Mother
Parent
Father
Mother
• If Elayne is a Mother,
Len is a Father and
Dan, Ari and Joe are
Children, what are
examples of each
relationship (not set)?
Child
11/7/2015
38
Binary vs. Ternary, ctd*
• Ternary relationship sets are succint; Parent is one
relationship set; FatherOf and MotherOf are two.
• But binary relationship sets always (see notes) hold as
much or more information as a ternary (or n-ary)
relationship sets.
– For example, if a child has no mother you can store information
about the child's father in a binary relationship but not in a ternary
relationship (without using NULLs).
• However, one ternary relationship (not set) will store
more information than one binary relationship
• (Elayne,Len,Dan) vs. (Elayne,Dan)
• In this class, if you have a choice, use a ternary relation!
– (No cardinality constraints in ternary relations)
11/7/2015
39
Duality:
entity
attribute
Project
Number
Name
Due Date
Assignment
Manager
Employee
Ssn
Name
Office
Should Office be an attribute of Employee? or a
separate entity set? Most attributes can be “promoted”
to an entity set and some entities can be “demoted” to
an attribute value.
This explains why there are so many different ways to
design a schema.
11/7/2015
40
Entity vs. Attribute
Project
Number
Name
Due Date
Assignment
Manager
Employee
Ssn
Name
Assigned
Office
Office
Number
What are some reasons to model Office as an entity set?
• an employee can have more than one office
• there are other attributes of Office
• Office needs to participate in other relationship sets
such as a relationship set connecting to furniture or
telephones or network drops (located in the office)
11/7/2015
41
Entity vs. Attribute
Project
Number
Name
Due Date
Assignment
Manager
Employee
Ssn
Name
Assigned
Office
Office
Number
sq-ft
orientation
Location
NetworkDrop
IP-address
location
11/7/2015
42
LO1.1 Practice: Pharmacy
• Each patient has a unique ssn, age and primary doctor.
• Each drug has a unique generic name and a unique chemical
formula.
• Each drug is sold by a at least two pharmacies. Each
pharmacy sells the drug at some price. Each pharmacy has
a unique address, a phone number, and sells at least 100
drugs.
• Each prescription is written by a doctor for a patient and
dispenses a quantity of a drug.
• Gather into groups of 2-3. Draw the ER diagram. Include
cardinality constraints and underline the primary keys.
11/7/2015
43
Reminder: Guidelines
• If something has an attribute, it is an entity set or a
relationship set
• Use language from the RA in the ER diagram
• Relationship set names should be part of a sentence
uincluding the names of the entity sets.
• Some parts of the RA may not belong in the ER
diagram.
• Keep the ER diagram as simple as possible.
– No redundancy
• Check that every item in the RA is in the ER diagram if
it belongs there, and nothing else.
11/7/2015
44
Your ER Diagram: Pharmacy
11/7/2015
45
Table of Contents
• Introduction
– Translating Data into Information
– Three Grand Challenges
• Data is widespread
• Data is vast
• Structured vs Unstructured Data
• ER Diagrams
– Requirements Analysis
• Use cases
–
–
–
–
–
–
Entities, Attributes, Relationships
Instance, Legal Instance
Keys, Primary and Candidate
Attributes of Relationship Sets
Ternary Relationship Sets
Enties vs Attributes
Done
11/7/2015
• Relational Data Model
–
–
–
–
–
Relation, row, attributes, instance
Cardinality, Degree, Domain
Keys, Primary and Candidate
Schema, Legal Instance
Finding Candidate & Foreign Keys
• ER Diag.  Schema of Tables
• Database, DBMS
–
–
–
–
–
Definitions
Languages
Postgresql
FEC data
SQL
•
•
•
•
•
History
NULL values
DDL, DML
SELECT/FROM/WHERE
DISTINCT,ORDER BY,LIKE
46
Data Models
• The next step in our quest to transform data into information is
to transform an ER diagram into a more highly structured
model than the ER model.
• A Data Model is a language used to describe the structure of
data.
• The relational model is the intergalactic standard for storing
structured data, although some legacy data is still stored in
network and hierarchical models, and some niche data is
stored in object, XML and other models.
– If we were building, say, an XML database, then we would
transform our ER diagram into an XML schema
• The relational model is a theory, mathematically based, so we
can use it to prove results, though some of its assumptions,
like requiring keys, may not be realistic.
11/7/2015
47
Relational Model: Notation*
• A relation is a set of rows, each having the same
attributes.
• Look at the Sample Relations handout, also below.
• Look at the instance (current contents) of the Candidate
relation.
• How many rows are in this instance of the Candidate
relation?
– Called the cardinality of the relation.
• How many attributes are in this instance of the
Candidate relation?
– Called the degree of the relation
Sample Relations Document in  http://www.cs.pdx.edu/~len/386/fec/Mats.doc
11/7/2015
48
Domains*
• What are the names of the attributes in DonateTo?
• Each attribute has a domain – its set of possible values.
– Possible domains: character strings, integers, dates
– Decide domain in terms of its planned use
• Arithmetic? Integer. Otherwise character
– Domain is like a type in computer science.
– A relation is mathematically defined as subset of the cross
product of the domains.
• What is the domain of each attribute in DonateTo?
11/7/2015
49
Keys
• Key, primary key, and candidate key have the same
definitions* as in ER diagrams, but the relational model
requires that every relation have at least one key
attribute.
– The primary key of each relation is used to identify each data
item/row.
– The displayed candid, commid and fecid values were chosen by
the FEC to identify each candidate, etc.
*Key: A minimal set of attributes that uniquely identifies each row
Primary key: One key chosen as the primary way to identify each row
Candidate key: any key other than the primary key
11/7/2015
50
Finding Candidate Keys
• A schema of a relation consists of the names of the
relation, its attributes, their domains (often omitted), and
its primary keys (underlined).
• Here are some schemas. Circle the candidate keys in
legal instances of the schemas.
Teacher (SSN, Name, Office, E-mail)
Course (Department, Number, Name, Description)
Class-Offering (Quarter, CRN, Section, Teacher, TimeDays)
Student (StudentID, Name, Major, Advisor)
Completed (StudentID, CRN, Quarter, Section, Grade)
• A domain expert is required to determine candidate keys
11/7/2015
51
More on Keys and Relations*
• This relation has a lot of attributes, named A-T. Can you
find a set of them that constitutes a key for this instance?
• Why or why not?
• What can you conclude about keys and relations?
• Is this really a relation?
• A relation must be a set.
A
1
1
0
1
z
B
2
2
2
w
2
11/7/2015
C
4
4
b
4
4
D
2
2
3
2
2
E
3
3
3
3
g
F
5
5
6
6
6
G
9
9
9
9
9
H
8
8
6
8
8
I
3
3
f
3
3
K
7
7
7
3
7
L
4
4
8
v
1
M
3
3
5
5
5
N
1
1
1
1
1
O
8
8
8
8
8
P
4
4
4
4
4
Q
9
9
9
9
9
R
1
1
1
1
1
S
3
3
3
3
3
T
44
44
xx
zz
44
52
Foreign Keys*
• A key of a relation uniquely identifies each data item.
• A foreign key references data items in other relations.
– It can reference a primary or a candidate key in the other
relation.
• For example, in the Sample Relations handout, what are
the foreign keys?
11/7/2015
53
LO1.2: Finding Foreign Keys
• Circle the foreign keys in these relations. Use an arrow
to point to the key that each foreign key references.
Teacher (SSN, Name, Office, E-mail)
Course (Department, Number, Name, Description)
Class-Offering (Quarter, CRN, Section, Teacher, TimeDays)
Student (StudentID, Name, Major, Advisor)
Completed (StudentID, CRN, Quarter, Section, Grade)
11/7/2015
54
A Broken Promise
• A table is something stored in a DBMS. It is the same as
a relation except it need not have a key and it can have
an order.
• Peek at the Databasics Anonymous 7-step method on the
next page.
• We defined the relational model so that we could transform an
ER diagram into a schema of relations.
• There is just one catch. Codd's relational model requires that
each relation have a key, and some entity sets, and some
tables in DBMSs, don't. So in practice we skip a step and
translate ER diagrams into a schema of tables.
11/7/2015
55
Transform an ER Diagram to a
Schema of Tables
•
Recall the Databasics Anonymous 7-step method:
Done 1. Organizing structured data into an ER Diagram, Ch. 2
Next 2. Transforming an ER diagram into a Schema of Tables, Ch. 3
3. Eliminating anomalies from those tables (normalization), Ch. 21
4. Structuring those tables efficiently (physical design), Ch. 22
5. Managing those tables using the intergalactic standard
language (SQL), Ch. 5
•
The DBMS manages the tables internally using Relational Algebra,
Ch. 4
6. Protecting those tables during concurrent use (ACID
properties), Ch. 16
7. Accessing those tables through a web-based interface (some
scripting language)
•
We’ve completed step 1, and learned about the relational
model so we could perform step 2. now we begin step 2
11/7/2015
56
Step 1: Translate each entity set into a table.
Introduce a primary key if you wish*
Candidate
CandName
Party
Address
Principal
Associated
Committee
CommName
Address
Donated to
Donor
Occupation
11/7/2015
57
2. Translate each many to many
relationship set into a table.*
Committee
CommID
Name
Address
•
1..*
DonatedTo
0..*
Donor
ID
Occupation
Amount
We seek a table that represents the relationship set
DonatedTo. Try to think of its name, attributes and
primary key.
Committee( CommID, Name, Address)
Donor( ID, Name, Occupation, Address)
11/7/2015
58
3. What about One to Many Rel. Sets?*
Candidate
CandID
Name
Party
Address
0..1
Associated
0..*
Committee
CommID
Name
Address
Candidate( CandID, Name, Party, Address)
Committee( CommID, Name, Address)
•
How can we represent the Associated Relationship set?
•
Replace the previous Committee table with
•
Or add the relation
•
Which alternative do you prefer? We use the first.
11/7/2015
59
Will the Real FEC database please
stand up?
• We have analyzed the raw FEC data using the ER
diagram technique and concluded, correctly, that we
should model it with Candidate, Committee, Donor and
DonatedTo tables.
• Now let's consider the real FEC database, the one
stored on the web by the FEC. It's quite different. Donor
data is not stored separately. The FEC stores donor
data (name, occupation, partial address) with each
donation, because that is how each campaign committee
collects and reports its data.
• Later in the class, when we study normalization, we will
see that this redundant storage of donor information can
lead to anomalies and problems, but it happens in the
real world.
11/7/2015
60
Summary: ER to Schema of Tables
1. Translate each entity set into a table,
introducing a primary key if you wish.
2. Translate each many to many relationship set
into a table with the primary keys of the
participating entity sets as keys and the
attributes as attributes.
3. Translate each one to many relationship set into
a foreign key in the “many” table and add its
attributes as attributes.
11/7/2015
61
LO1.3: Translate ER to Tables*
Employee
SSN
Name
0..* Assigned
1..1
PercentTime
11/7/2015
0..*
Manager
0..*
Project
Number
Name
Budget
Since
62
Review, Preview
• Remember that we are working with structured data. We
will work with unstructured data, like documents and web
pages, at the end of the course.
• We’ve taken raw data, expressed it as an ER diagram, and
transformed it into a schema of tables.
• Our next step is to implement these tables in a software
system, called a relational database management system
(RDBMS, or DBMS).
– Example DBMSs: Postgresql(we will use), MySQL (both open
source), Access, Oracle, DB2, SQLServer
• Next we will survey DBMSs and study in depth their query
languages: relational algebra and SQL.
11/7/2015
63
Databasics Anonymous 7-step method
Done
1. Organizing structured data into an ER Diagram, Ch. 2
2. Transforming an ER diagram into a Schema of Tables, Ch. 3
3. Eliminating anomalies from those tables (normalization), Ch. 21
4. Structuring those tables efficiently (physical design), Ch. 22
Next
5. Managing those tables using the intergalactic standard language
(SQL), Ch. 5
–
The DBMS manages the tables internally using Relational Algebra, Ch. 4
6. Protecting those tables during concurrent use (ACID properties), Ch.
16
7. Accessing those tables through a web-based interface (some
scripting language)
11/7/2015
64
What is a Database? DBMS?
• Database (DB) - a collection of structured persistent
data, typically >> physical memory
– Persistent: exists after its creator terminates
• Database management system (DBMS) - a software
system that supports the definition, population,
querying*, updating and management of a database.
DBMS
DB
*The term “query” typically refers to read, as opposed to update, statements
11/7/2015
65
What is a DBMS, really?
• We said a DBMS supports the definition, etc., of a
database. So why isn’t Excel a DBMS?
• True support means:
– A sound theoretical model, the relational model, supporting
the data’s structure and access.
– An efficient (fast) and relatively simple (declarative)
language, SQL, for retrieval of disk-based data.
– Disk-based data structures to speed up data retrieval.
– Logical and physical independence of programs from data
– Safe access to data by multiple users
– Recovery from crashes
– An extensive security system
– Support for parallel and distributed access to data
– Support for access to nonstandard types of data
• blobs, geometric data, video, etc.
11/7/2015
66
Cast of Characters
• Database vendor developers
• Database application developers
• Database administrators (DBAs)
–
–
–
–
–
Designs & loads the database
Security, Authorization
Crash Recovery
Tuning
Help….
11/7/2015
67
(R)DBMS Languages
• We will learn two DBMS languages
• The first, relational algebra, was originally defined by
Codd as a theoretical language to accompany the
relational model, to query relations. It is also the
language used internally by a DBMS to query databases.
• The second, SQL, is used by programmers to define,
populate, query, update and manage databases.
11/7/2015
68
(R)DBMS Architecture*
Web Form
Applic. Front end
SQL interface
SQL
Security
Parser
Relational Algebra(RA)
Catalog
Optimizer
Executable Plan (RA+Algorithms)
Concurrency
Crash
Recovery
Plan Executor
Files, Indexes &
Access Methods
Database, Indexes
11/7/2015
69
Relations vs. Tables
• Recall that a relation is described by a schema such
as Candidate(Candid,Name,…).
• Relation is a theoretical concept.
• Every relation must have a key.
• Table is a concept defined in SQL. It is an object
stored by a DBMS. It is sometimes called a SQL
relation.
• A relational database consists of tables
• A table has exactly the same properties as a relation
except that it need not have a key. A table is a
multiset, not a set.
11/7/2015
70
Relations vs. Tables, ctd.*
• Recipients, below, is the table (note it does not have a
key) of committees that received donations from zip
code 97223 in March.
• SQL operates on tables only.
• Relational algebra operates on relations, in the relational
model, and on tables, in a DBMS.
• We will study relational algebra operating on tables.
Recipients commid
C00383554
C00383554
C00383554
C00383554
C00430470
C00431445
C00431445
C00431445
C00437277
C00437277
C00437277
11/7/2015
71
The FEC Database
• We're going to learn SQL and relational algebra using the (real)
FEC database
– See http://www.fec.gov/finance/disclosure/ftpdet.shtml#a2007_2008
• We've stored the four significant tables in a Postgresql DBMS
Table
Name Rows
cand
3833
comm
10287
indiv 928956
pas
172675
Meg Table Meaning
1
3
157
13
Candidates
Committees
Donations from Individuals to Committees
Donations from Committees to Candidates
• We gave them short names so you would not have to type so
much, and lower case names because Postgresql (PG) does not
handle uppercase names well.
• The PG database also includes "or" versions (for homework) and
"cl" versions (for class) of each table.
11/7/2015
72
Postgresql and the FEC data*
•
You can access the FEC database with the user name cs386
and the password introdb. You have a choice of two
interfaces to the Postgresql FEC DBMS.
1. There is a gui interface at
https://dbclass.cs.pdx.edu/phppgadmin
2. and a command line interface from any psu unix machine:
$ addpkg
# choose pgsql-8. Do this just once
$ psql -U cs386 -h dbclass.cs.pdx.edu cs386
•
Details are in the notes view.
•
Postgres documentation is at
http://www.postgresql.org/docs/8.3/interactive/
11/7/2015
73
The CL tables*
• In class we will use excerpts from the full database, with
the suffix cl to signify "in class".
• See the handout and below.
CL Tables Document in  http://www.cs.pdx.edu/~len/386/fec/Mats.doc
11/7/2015
74
History of SQL
• SQL (pronounced EssQueEll or Sequel) was part of System R,
now DB2.
– Everyone agrees that Ingres’ language, Quel, was technically
superior, but IBM’s marketing won the day.
• SQL is maintained by ANSI & ISO. The major standards are
– 1992 (SQL2)
– 1999 (SQL3), added regular expression matching and triggers,
recursive queries, procedural and control-of-flow statements, nonscalar types, and some object-oriented features
– 2003,6 added XML related features, window functions,
standardized sequences & auto-generated values.
– See the syllabus for copies of the SQL2,3 standards.
• DBMSs typically support almost all of the SQL2 & 3 standards
but differ in a few crucial areas, e.g., join syntaxes and stored
procedures.
• Each Postgresql SQL documentation page tells, at the bottom, if
any part of the statement is nonstandard.
11/7/2015
75
NULL Values & Primary Keys in SQL
• SQL supports a special value called NULL, meaning
that the value is
– Unknown, e.g., age of someone unwilling to specify, or
– Inapplicable, e.g., spouse of a single employee
• Declaring a field to be a primary key in SQL means that
it is unique (key in the usual sense) and contains no
NULL values.
– Note that this differs from the meaning of key in the ER and
relational models.
– It makes sense because a primary key is the principal means
to locate an item and a NULL value would not make sense.
– Foreign keys can have NULL values.
• Thus when creating a table, declare an attribute to be
NOT NULL if it will be a primary key.
11/7/2015
76
DDL for FEC Database
•
The DDL statements for the FEC database exist in three
files:
0. u2uc – not DDL. A shell script to format the FEC files for
copying into PostgreSQL
1. Tables.sql - CREATE TABLE statements
2. cs - A shell script that calls psql statements to bulk copy
the data into any PostgreSQL database.
3. DDL.sql - Most of the DDL statements: ALTER TABLE,
etc.
•
Don't try to understand them all at this point.
FEC DDL Documents  http://www.cs.pdx.edu/~len/386/fec/Mats.sql
11/7/2015
77
Why Separate CREATE TABLE?
•
The no-brainer way to create a database is
1. Issue all DDL statements: CREATE TABLE, create all
structures such as primary keys, foreign keys, indexes.
2. Add all data, with INSERT or bulk copy statements.
•
•
•
For medium to large databases, this can take forever.
As each row is added to the database, the DBMS will
search the disk (horrors) to find its place in each
structure.
So in real life, code like that on page 68 of our text is
rarely used. Instead, we
1. Issue all CREATE TABLE statements
2. Bulk copy all data into the database
3. Issue all DDL statements, using ALTER TABLE…
•
This allows the DBMS, in step 3, to process each
structure efficiently.
11/7/2015
78
SQL’s Data Manipulation Language
• SQL’s DML consists of these statements:
–
–
–
–
INSERT INTO table VALUES …
SELECT … FROM..[WHERE]
UPDATE table SET column = …
DELETE FROM table WHERE …
• We will study the SELECT clause in detail. The
syntax of the UPDATE and DELETE clauses are
similar, and all of them can be found in the
documentation.
• The acronym CRUD (Create(INSERT),
Read(SELECT), Update, Delete) is often used to
refer to the four basic operations on persistent data.
When you have implemented CRUD, your application
is complete.
11/7/2015
79
Querying a Database With SQL
• SQL’s SELECT statement retrieves data
• What are the occupations of donors in indivcl who gave
> $1000?
SELECT occup
FROM indivcl
3 retain this attribute
1 consider rows in this table
WHERE amount > 1000; 2 choose rows with this condition
• The answer
occup
PACIFIC CREST/PRINCIPLE
PACIFIC CREST/PRINCIPLE
SELF-EMPLOYED
Notice that SQL's clauses are "out of order". One of SQL's primary problems.
11/7/2015
80
Practice*
• Describe in SQL : What are the occupations of donors
who made donations in March (the 3rd month)?
• What does this query retrieve?
SELECT occup
FROM indivcl
WHERE month = 2 AND month = 3;
• Is there anything wrong with this query?
11/7/2015
81
Eliminating Duplicates in SQL
• The answer to the previous query contained two copies
of the occupation "PACIFIC CREST/PRINCIPLE".
• How can we eliminate such duplicates?
SELECT DISTINCT occup
FROM indivcl
WHERE month = 3;
11/7/2015
82
ORDER BY
• Display donations in order of month, most recent first,
and then in order of donation, smallest first.
SELECT * FROM indivcl
ORDER BY month desc, amount ASC;
• ASC is the default.
• SELECT * means retrieve all attributes
11/7/2015
83
LIKE Predicate
• Wildcards in SQL are different than regular expressions
in UNIX.
• %: 0 or more characters, _: One character
• List the donations of donors who are retired or selfemployed.
SELECT * FROM indivcl
WHERE LOWER(occup) LIKE '%retire%' OR
LOWER(occup) LIKE '%self%';
• NOT LIKE is also possible
• SQL3 includes many string manipulation functions such
as LOWER( )
– http://web.cecs.pdx.edu/~len/sql1999.pdf, pg. 14
11/7/2015
84
Structure of SQL
• SQL includes two major sections, DDL and DML
• DDL, the Data Definition Language, defines the
schema of the database.
• CREATE statements create the schema.
• ALTER statements can change the schema.
• DROP statements can remove parts of the schema.
• See the Postgresql documentation for detailed syntax
of these statements.
11/7/2015
85
Exercises*
1. Draw an ER diagram to keep track of votes taken in the US
House of Representatives during the current 2-year
congressional session. [Note that the house, with 435
members, is separate from the Senate.] Show all
cardinalities and primary keys.
•
•
•
•
•
Keep track of each congressperson ( unique name, district number,
party).
Keep track of the parties (Republican, Democrat or Independent)
and for each of the two major parties, keep track of the name of the
party’s (majority or minority) leader.
For each bill, keep track of its ID, the date it was first proposed, the
date each sponsor signed on to sponsor it and its status.
Record how each congressperson voted on each bill (yes, no,
abstain or absent), assuming there was only one vote (unrealistic).
Record cardinalities and primary keys.
11/7/2015
86
Exercises, ctd.*
2. Identify, with circles and arrows, all the foreign keys in
this schema.
movies( title, year, genre, studio, producer)
movie_star( stage_name, ssn, address, birthday )
stars_in( name, movie, year)
studio_info( name, address, president )
producer_data(name, address)
•
•
Assume no two movies in one year have the same title.
Assume no two movie stars have the same stage name.
11/7/2015
87
Exercises, ctd.*
3.
Describe a schema of tables for this ER diagram from a school where
courses are team-taught:
Teacher
Ssn
Name
Degree
1..*
Belongs To
1..*
0..*
Team
Name
Primary Area
PerCt
Favorite
1..1
Teaches
Date
Since
11/7/2015
1..*
Course
1..1 Name
Description
88