Introduction to Databases, CS 3/586 Fall 2009 Instructor: Len Shapiro, len at pdx.edu Syllabus: www.cs.pdx.edu/~len/386 This work, and all other files in this series,
Download ReportTranscript Introduction to Databases, CS 3/586 Fall 2009 Instructor: Len Shapiro, len at pdx.edu Syllabus: www.cs.pdx.edu/~len/386 This work, and all other files in this series,
Introduction to Databases, CS 3/586 Fall 2009 Instructor: Len Shapiro, len at pdx.edu Syllabus: www.cs.pdx.edu/~len/386 This work, and all other files in this series, are licensed under the Creative Commons Attribution License. To view a copy of this license, visit http://creativecommons.org/licenses or send a letter to Creative Commons, 171 2nd Street, Suite 300, San Francisco, California, 94105, USA. These slides have benefitted from the content of slides developed by Raghu Ramakrishnan and Johannes Gehrke, Lois Delcambre and David Maier, all of whom I thank. These slides are best viewed with “slide show”. To print the slides, consider using "File/Print/Properties/Pages per sheet" instead of Power Point's "handouts" mode. Slides will be ready by 6PM each Monday evening. 11/7/2015 1 Table of Contents • Introduction – Translating Data into Information – Three Grand Challenges • Data is widespread • Data is vast • Structured vs Unstructured Data • ER Diagrams – Requirements Analysis • Use Cases – – – – – – Entities, Attributes, Relationships Instance, legal instance Keys, Primary and Candidate Attributes of Relationship Sets Ternary Relationship Sets Enties vs Attributes 11/7/2015 • Relational Data Model – – – – – Relation, row, attributes, instance Cardinality, Degree, Domain Keys, Primary and Candidate Schema, Legal Instance Finding Candidate & Foreign Keys • ER Diag. Schema of Tables • Database, DBMS – – – – – Definitions Languages Postgresql FEC data SQL • • • • • History NULL values DDL, DML SELECT/FROM/WHERE DISTINCT,ORDER BY,LIKE 2 Prerequisites • The prerequisites for this course are CS 161 and 250. • This week's class will expect you to know the definitions of these terms from CS250: sets, bags, tuples, relations. • Please review these concepts from the text that you used for CS250 or its equivalent. • The text that we use at PSU is [James Hein, Discrete Structures, Logic and Compatibility, 3rd Edition, Jones and Bartlett, 2009]. 11/7/2015 3 Learning Objectives* LO1.1 Write an ER diagram from an RA using UML, including primary keys, cardinality constraints, attributes of relationship sets, and ternary relationship sets LO1.2 Given a schema, find all foreign keys for legal instances of the schema. LO1.3 Translate an ER diagram into a schema of tables, preserving all information. *An asterisk in the title means that there is info on the notes page (View/Notes page). This info is either what I say in class, or answers to questions posed on the slides. I recommend that you NOT print the notes page in the copy of the slides that you bring to class, but look at the notes page if you miss class or if your notes are incomplete. 11/7/2015 4 Information • Digital Information is precious • Modern business, culture and society could not exist in its present form without digital information. • Think about how often you use digital information in your life. • If most modern businesses lost their current and backup information they would be candidates for bankruptcy. • But nature gives us data, not information. • Data: disorganized, slow to access, understand and visualize 11/7/2015 5 Data vs. Information • Data has its place. It’s not useless. Data Information Atlas Encyclopedia Brick and Mortar Store Printout Yahoo Maps, Mapquest Wikipedia, Google Search Online Store, e.g., Amazon Data Access Form • In other CS courses, you learn how to manipulate (inmemory) data, using programs. • In CS386/586 you will learn to transform data into information and manage that information. 11/7/2015 6 Three Challenges of Data* 1. Data is widespread – accessible by the Internet – Covered in CS4/594, Internetworking Protocols 2. Data is vast - Much too large to fit in memory • • • • • Library of congress: 20 terabytes (see notes view for units) Photos uploaded to Facebook each month: 20 terabytes Amazon.com: 42 terabytes Choicepoint: 250 terabytes Data processed by Google’s servers each hour: 1 Petabyte – In this class you’ll learn how to • • • 11/7/2015 Transform data into information Manage disk-based information Manage information in parallel 7 Database Challenges 3: Data Structure* • Some data, e.g., business data, has clearly defined attributes. In this class you’ll learn how to transform such structured data into information by the Databasics Anonymous 7-step method**: 1. 2. 3. 4. 5. Organizing structured data into an ER Diagram, Ch. 2 Transforming an ER diagram into a Schema of Tables, Ch. 3 Eliminating anomalies from those tables (normalization), Ch. 21 Structuring those tables efficiently (physical design), Ch. 22 Managing those tables using the intergalactic standard language (SQL), Ch. 5 • The DBMS manages the tables internally using Relational Algebra, Ch. 4 6. Protecting those tables during concurrent use (ACID properties), Ch. 16 7. Accessing those tables through a web-based interface (some scripting language) **I chose this name out of respect for Alchoholics Anonymous, which has saved innumerable lives 11/7/2015 8 Data Challenges 3: Structure (ctd) • Some data’s attributes are not clearly defined, e.g., documents, web pages – The attributes are words in the data or tags – In this class you’ll learn how to transform unstructured data into information by: • Building data structures to make retrieval efficient (inverted indexes) • Ranking retrieval results (Google’s page rank algorithm) • Measuring the effectiveness of querying unstructured data (recall) 11/7/2015 9 A Brief Database History* 1960s: Network and hierarchical models of structured data 1970s: E.F. Codd's relational model, implemented in Ingres and System R 1976: P. Chen's ER model 1980s: Relational systems/SQL dominate, PC DBMSs become popular 1990s: Client-server architecture, object-oriented model, Excel, WWW, Google 11/7/2015 10 Overview • This ends our introduction • First we will study structured data, for which the intergalactic standard model is the relational model. • Recall the Databasics Anonymous 7-step method**: 1. 2. 3. 4. 5. Organizing structured data into an ER Diagram, Ch. 2 Transforming an ER diagram into a Schema of Tables, Ch. 3 Eliminating anomalies from those tables (normalization), Ch. 21 Structuring those tables efficiently (physical design), Ch. 22 Managing those tables using the intergalactic standard language (SQL), Ch. 5 1. 6. 7. • The DBMS manages the tables internally using Relational Algebra, Ch. 4 Protecting those tables during concurrent use (ACID properties), Ch. 16 Accessing those tables through a web-based interface (some scripting language) We will perform steps 1, 2 and part of 5 this week. 11/7/2015 11 Election Data • For examples in this course we’ll be using data, collected by the Federal Elections Commission, about contributions to campaign committees, from the 2008 election*. • General info about the FEC is at http://www.fec.gov/ans/answers_general.shtml . • Forms-based interfaces to the data are at http://www.fec.gov/finance/disclosure/disclosure_data_search.shtml • Raw data in tabular form is at http://www.fec.gov/finance/disclosure/ftpdet.shtml • Our, and all database texts, use simple data examples. We hope that by using real data you’ll not only gain more valuable experience but be able to ask interesting questions of the data. • Perhaps you’ll make some newsworthy discoveries. *This idea was suggested by Vassilis Papadimos, whom we thank. 11/7/2015 12 Raw Data* • When you first encounter data, it may be in a raw, disorganized form. In front of you, and below, is some raw FEC data. Note – Some data (at the top) is descriptive, some (at the bottom) is in tables. – Some data items are self contained • But note “at the same address as his principal committee” – Some notation needs human interpretation • ‘names all start with “McCain Victory” and end with one of…’ Raw Data Document in http://www.cs.pdx.edu/~len/386/fec/Mats.doc 11/7/2015 13 Real World Data • Real world data often starts out like the raw data in front of you, except that there is often much more data and many more attributes. • The first step in transforming this raw data into information is to draw a diagram of it, with the hope that your customer can understand the diagram and thus the structure of the data. • One way to make the diagram understandable is to use few concepts in the diagram: primarily entity and relationship. 11/7/2015 14 Entity and Relationship • An entity is a real-world object distinguishable from other objects. – An entity is described using a set of attributes. • A relationship is an association among 2 or more entities. • What are some possible entities, relationships and attributes in the raw FEC data? 11/7/2015 15 ER Diagram of FEC Data* Candidate CandName Party Address Principal Associated Committee CommName Address Entity Set Attribute Donated to Relationship Set Donor Occupation 11/7/2015 16 Entity, Relationship Sets* • There is a subtle distinction here: – Gordon Smith, a candidate, is an entity. – “Candidate” is the name of the set of all candidates and is called an entity set. – Similarly, the pair (Jeff Merkley, Jeff Merkley for Oregon) is a relationship, but “Principal” is a relationship set. • What are some examples of entities, entity sets, relationships ,and relationship sets? • I will sometimes be sloppy and refer to Candidate, Committee, etc as entities, but that is not precise. • There is no such distinction for attributes 11/7/2015 17 Requirements Analysis • The standard wisdom (from our text and CS300) is to transform the raw data, as in the handout, into a Requirements Analysis, or RA, as on the next page, then to transform the RA into an ER diagram. • From now on we’ll give you an RA and expect you to transform it into an ER diagram. 11/7/2015 18 Requirements Analysis • Store information about employees, departments and projects – Each employee has a ssn and name – Each department has a code and a name – Each project has a number, name, due date and budget – Employees are assigned to projects – Departments sponsor projects – Each department has a manager and each employee has a home department 11/7/2015 19 Guidelines • If something has an attribute, it is an entity set • Use language from the RA in the ER diagram • Relationship set names should be part of a sentence including the names of the entity sets. • Some parts of the RA may not belong in the ER diagram. • Keep the ER diagram as simple as possible. – No redundancy • Check that every item in the RA is in the ER diagram if it belongs there, and nothing else. 11/7/2015 20 Your Emp-Dept ER Diagram* 11/7/2015 21 Instance, Legal Instance* • An instance of an entity set is the current contents of the entity set. • A legal instance is a an instance that could occur in the real world. – What is an example of an illegal instance? • Normally you need a domain expert to determine what is a legal instance. • For example, can two departments have the same name? The same code? The same name and the same code both? It depends on company policies – only a domain expert will know. 11/7/2015 22 Keys* • A key is a minimal set of attributes that uniquely defines an entity in an entity set, for all legal instances of the entity set. – Is name a key for Employee? (Number,Name) a key for Project? – You don't know unless you ask a domain expert. • If there is more than one key in an entity set, one is chosen as the primary key; the others are called candidate keys. – The primary key is usually the one by which the data items are most often accessed • In ER diagrams, the primary key is underlined. • PRACTICE: Underline primary keys in the Emp-Dept ER diagram. – Len is the domain expert. • Note that some ER diagrams do not have keys, e.g., the FEC data. It is common, but not required, during the transformation from data to information, to add keys at some point. 11/7/2015 23 Requirements Analysis: Professors • Professors have a SSN, a name and an age and their SSNs uniquely identify them. • There is a bulletin. In the bulletin, each course is listed with the professor who supervises the course and the professors who teach the course. • Courses are uniquely identified by their Quarter and CRN, and they have a name. • Sample data: CS386 taught by Len and Lois, supervised by Dave • PRACTICE: Draw an ER Diagram, including primary keys – INCLUDE ONLY INFO IN THE RA! 11/7/2015 24 Your Prof ER Diagram* 11/7/2015 25 Sidebar: Use cases • In Homework 7 you will need to define 3-5 use cases for your application to illustrate how your users will interact with your application. • I'll give you an example so you can begin thinking about that assignment. • A use case describes how an actor interacts with the system you have built. • Here's an example use case for the Prof-Course ER diagram we just produced, assuming it has been implemented as a system. 11/7/2015 26 Example Use Case: Find Course Supervisor Actor: Student who wants to find course supervisor 1. Student logs into Banner a. If login is invalid, issue error message and return to step 1 2. Student chooses "find course supervisor" from menu. 3. Student inserts quarter and CRN from dropdown list. 4. System returns course supervisor information. 5. Student chooses "exit" from menu. 11/7/2015 27 Notes about Use Cases • Each use case must have a name and an actor. The actor may be another system, for example if another company is purchasing an item. • Each step of the use case is doable by the actor or the system you have built. • Error cases can be handled inline, as in the example, or at the end of the use case. • In order to find uses cases, think of who users are, what they will do, and/or what screens will look like. • Use the given format: start with the actor and number the steps. 11/7/2015 28 Cardinality Constraints on Relship sets: How many entities can participate? Candidate i..j Associated m..n CandName Party Address Committee CommName Address • m..n means each candidate is the associated candidate for at least m and at most n different committees. i..j means that each committee has at least i and at most j associated candidates. – What are reasonable values for m, n, i and j? Don’t forget to skip over. 11/7/2015 29 FEC With Cardinality Constraints* • Fill in the correct cardinalities Candidate CandName Party Address Principal 0..1 Associated 0..* Committee CommName Address Donated to Donor Occupation A domain expert, who knows the semantics of the application, is needed to attach correct cardinalities. 11/7/2015 30 More Cardinality Constraints • A relationship like Associated is one-to-many if one row is Associated with many rows but not vice versa. 11/7/2015 31 More Cardinality Concepts • Many to Many Relationships – A left entity is related to many right entities and vice versa. – Which of the FEC relationships is many to many? – How can you tell that a relationship is many to many • The max cardinality on both sides is greater than one • One to Many Relationships – A left entity is related to many right entities but not vice versa. – Which of the FEC relationships is one to many? • Max cardinality on at most one side is greater than one • One to One Relationships – A left entity is related to exactly one right entity, and vice versa. – Which of the relationships is one to one? • Cardinality on both sides is 1..1 • Relatively rare, means two entity sets correspond exactly. 11/7/2015 32 FEC with new Cardinalities Candidate CandName Party Address One to Many 0..* Principal 0..1 0..1 Associated 0..* One to Many Committee CommName Address 1..* Donated to Many to Many 0..* Donor Occupation 11/7/2015 33 Relationship sets can have attributes Employee Ssn Name 0..* home 0..1 Department Code Name lot start-date descriptive attribute of the relationship set 11/7/2015 34 Try all three locations for the attributes: What does each one mean? Employee Ssn Name 0..* home 0..1 Department code name Lot start-date ? 11/7/2015 start-date ? start-date ? 35 What is a possible attribute for the DonatedTo Relationship Set?* Candidate CandName Party Address 0..1 Principal 1..1 0..1 Associated 0..* Committee CommName Address 1..* Donated to 0..* Donor Occupation 11/7/2015 36 What’s wrong with this picture? Doctor Name Address Specialty Primary Care Appointment Patient Ssn Address Insurance ID Time 11/7/2015 37 Binary vs. Ternary Relship Sets* • A classic example of a ternary relationship set is Parent, between the entity sets father, mother and child. – The alternative is to store the data in binary relationship sets Father and Mother. Father Mother Parent Father Mother • If Elayne is a Mother, Len is a Father and Dan, Ari and Joe are Children, what are examples of each relationship (not set)? Child 11/7/2015 38 Binary vs. Ternary, ctd* • Ternary relationship sets are succint; Parent is one relationship set; FatherOf and MotherOf are two. • But binary relationship sets always (see notes) hold as much or more information as a ternary (or n-ary) relationship sets. – For example, if a child has no mother you can store information about the child's father in a binary relationship but not in a ternary relationship (without using NULLs). • However, one ternary relationship (not set) will store more information than one binary relationship • (Elayne,Len,Dan) vs. (Elayne,Dan) • In this class, if you have a choice, use a ternary relation! – (No cardinality constraints in ternary relations) 11/7/2015 39 Duality: entity attribute Project Number Name Due Date Assignment Manager Employee Ssn Name Office Should Office be an attribute of Employee? or a separate entity set? Most attributes can be “promoted” to an entity set and some entities can be “demoted” to an attribute value. This explains why there are so many different ways to design a schema. 11/7/2015 40 Entity vs. Attribute Project Number Name Due Date Assignment Manager Employee Ssn Name Assigned Office Office Number What are some reasons to model Office as an entity set? • an employee can have more than one office • there are other attributes of Office • Office needs to participate in other relationship sets such as a relationship set connecting to furniture or telephones or network drops (located in the office) 11/7/2015 41 Entity vs. Attribute Project Number Name Due Date Assignment Manager Employee Ssn Name Assigned Office Office Number sq-ft orientation Location NetworkDrop IP-address location 11/7/2015 42 LO1.1 Practice: Pharmacy • Each patient has a unique ssn, age and primary doctor. • Each drug has a unique generic name and a unique chemical formula. • Each drug is sold by a at least two pharmacies. Each pharmacy sells the drug at some price. Each pharmacy has a unique address, a phone number, and sells at least 100 drugs. • Each prescription is written by a doctor for a patient and dispenses a quantity of a drug. • Gather into groups of 2-3. Draw the ER diagram. Include cardinality constraints and underline the primary keys. 11/7/2015 43 Reminder: Guidelines • If something has an attribute, it is an entity set or a relationship set • Use language from the RA in the ER diagram • Relationship set names should be part of a sentence uincluding the names of the entity sets. • Some parts of the RA may not belong in the ER diagram. • Keep the ER diagram as simple as possible. – No redundancy • Check that every item in the RA is in the ER diagram if it belongs there, and nothing else. 11/7/2015 44 Your ER Diagram: Pharmacy 11/7/2015 45 Table of Contents • Introduction – Translating Data into Information – Three Grand Challenges • Data is widespread • Data is vast • Structured vs Unstructured Data • ER Diagrams – Requirements Analysis • Use cases – – – – – – Entities, Attributes, Relationships Instance, Legal Instance Keys, Primary and Candidate Attributes of Relationship Sets Ternary Relationship Sets Enties vs Attributes Done 11/7/2015 • Relational Data Model – – – – – Relation, row, attributes, instance Cardinality, Degree, Domain Keys, Primary and Candidate Schema, Legal Instance Finding Candidate & Foreign Keys • ER Diag. Schema of Tables • Database, DBMS – – – – – Definitions Languages Postgresql FEC data SQL • • • • • History NULL values DDL, DML SELECT/FROM/WHERE DISTINCT,ORDER BY,LIKE 46 Data Models • The next step in our quest to transform data into information is to transform an ER diagram into a more highly structured model than the ER model. • A Data Model is a language used to describe the structure of data. • The relational model is the intergalactic standard for storing structured data, although some legacy data is still stored in network and hierarchical models, and some niche data is stored in object, XML and other models. – If we were building, say, an XML database, then we would transform our ER diagram into an XML schema • The relational model is a theory, mathematically based, so we can use it to prove results, though some of its assumptions, like requiring keys, may not be realistic. 11/7/2015 47 Relational Model: Notation* • A relation is a set of rows, each having the same attributes. • Look at the Sample Relations handout, also below. • Look at the instance (current contents) of the Candidate relation. • How many rows are in this instance of the Candidate relation? – Called the cardinality of the relation. • How many attributes are in this instance of the Candidate relation? – Called the degree of the relation Sample Relations Document in http://www.cs.pdx.edu/~len/386/fec/Mats.doc 11/7/2015 48 Domains* • What are the names of the attributes in DonateTo? • Each attribute has a domain – its set of possible values. – Possible domains: character strings, integers, dates – Decide domain in terms of its planned use • Arithmetic? Integer. Otherwise character – Domain is like a type in computer science. – A relation is mathematically defined as subset of the cross product of the domains. • What is the domain of each attribute in DonateTo? 11/7/2015 49 Keys • Key, primary key, and candidate key have the same definitions* as in ER diagrams, but the relational model requires that every relation have at least one key attribute. – The primary key of each relation is used to identify each data item/row. – The displayed candid, commid and fecid values were chosen by the FEC to identify each candidate, etc. *Key: A minimal set of attributes that uniquely identifies each row Primary key: One key chosen as the primary way to identify each row Candidate key: any key other than the primary key 11/7/2015 50 Finding Candidate Keys • A schema of a relation consists of the names of the relation, its attributes, their domains (often omitted), and its primary keys (underlined). • Here are some schemas. Circle the candidate keys in legal instances of the schemas. Teacher (SSN, Name, Office, E-mail) Course (Department, Number, Name, Description) Class-Offering (Quarter, CRN, Section, Teacher, TimeDays) Student (StudentID, Name, Major, Advisor) Completed (StudentID, CRN, Quarter, Section, Grade) • A domain expert is required to determine candidate keys 11/7/2015 51 More on Keys and Relations* • This relation has a lot of attributes, named A-T. Can you find a set of them that constitutes a key for this instance? • Why or why not? • What can you conclude about keys and relations? • Is this really a relation? • A relation must be a set. A 1 1 0 1 z B 2 2 2 w 2 11/7/2015 C 4 4 b 4 4 D 2 2 3 2 2 E 3 3 3 3 g F 5 5 6 6 6 G 9 9 9 9 9 H 8 8 6 8 8 I 3 3 f 3 3 K 7 7 7 3 7 L 4 4 8 v 1 M 3 3 5 5 5 N 1 1 1 1 1 O 8 8 8 8 8 P 4 4 4 4 4 Q 9 9 9 9 9 R 1 1 1 1 1 S 3 3 3 3 3 T 44 44 xx zz 44 52 Foreign Keys* • A key of a relation uniquely identifies each data item. • A foreign key references data items in other relations. – It can reference a primary or a candidate key in the other relation. • For example, in the Sample Relations handout, what are the foreign keys? 11/7/2015 53 LO1.2: Finding Foreign Keys • Circle the foreign keys in these relations. Use an arrow to point to the key that each foreign key references. Teacher (SSN, Name, Office, E-mail) Course (Department, Number, Name, Description) Class-Offering (Quarter, CRN, Section, Teacher, TimeDays) Student (StudentID, Name, Major, Advisor) Completed (StudentID, CRN, Quarter, Section, Grade) 11/7/2015 54 A Broken Promise • A table is something stored in a DBMS. It is the same as a relation except it need not have a key and it can have an order. • Peek at the Databasics Anonymous 7-step method on the next page. • We defined the relational model so that we could transform an ER diagram into a schema of relations. • There is just one catch. Codd's relational model requires that each relation have a key, and some entity sets, and some tables in DBMSs, don't. So in practice we skip a step and translate ER diagrams into a schema of tables. 11/7/2015 55 Transform an ER Diagram to a Schema of Tables • Recall the Databasics Anonymous 7-step method: Done 1. Organizing structured data into an ER Diagram, Ch. 2 Next 2. Transforming an ER diagram into a Schema of Tables, Ch. 3 3. Eliminating anomalies from those tables (normalization), Ch. 21 4. Structuring those tables efficiently (physical design), Ch. 22 5. Managing those tables using the intergalactic standard language (SQL), Ch. 5 • The DBMS manages the tables internally using Relational Algebra, Ch. 4 6. Protecting those tables during concurrent use (ACID properties), Ch. 16 7. Accessing those tables through a web-based interface (some scripting language) • We’ve completed step 1, and learned about the relational model so we could perform step 2. now we begin step 2 11/7/2015 56 Step 1: Translate each entity set into a table. Introduce a primary key if you wish* Candidate CandName Party Address Principal Associated Committee CommName Address Donated to Donor Occupation 11/7/2015 57 2. Translate each many to many relationship set into a table.* Committee CommID Name Address • 1..* DonatedTo 0..* Donor ID Occupation Amount We seek a table that represents the relationship set DonatedTo. Try to think of its name, attributes and primary key. Committee( CommID, Name, Address) Donor( ID, Name, Occupation, Address) 11/7/2015 58 3. What about One to Many Rel. Sets?* Candidate CandID Name Party Address 0..1 Associated 0..* Committee CommID Name Address Candidate( CandID, Name, Party, Address) Committee( CommID, Name, Address) • How can we represent the Associated Relationship set? • Replace the previous Committee table with • Or add the relation • Which alternative do you prefer? We use the first. 11/7/2015 59 Will the Real FEC database please stand up? • We have analyzed the raw FEC data using the ER diagram technique and concluded, correctly, that we should model it with Candidate, Committee, Donor and DonatedTo tables. • Now let's consider the real FEC database, the one stored on the web by the FEC. It's quite different. Donor data is not stored separately. The FEC stores donor data (name, occupation, partial address) with each donation, because that is how each campaign committee collects and reports its data. • Later in the class, when we study normalization, we will see that this redundant storage of donor information can lead to anomalies and problems, but it happens in the real world. 11/7/2015 60 Summary: ER to Schema of Tables 1. Translate each entity set into a table, introducing a primary key if you wish. 2. Translate each many to many relationship set into a table with the primary keys of the participating entity sets as keys and the attributes as attributes. 3. Translate each one to many relationship set into a foreign key in the “many” table and add its attributes as attributes. 11/7/2015 61 LO1.3: Translate ER to Tables* Employee SSN Name 0..* Assigned 1..1 PercentTime 11/7/2015 0..* Manager 0..* Project Number Name Budget Since 62 Review, Preview • Remember that we are working with structured data. We will work with unstructured data, like documents and web pages, at the end of the course. • We’ve taken raw data, expressed it as an ER diagram, and transformed it into a schema of tables. • Our next step is to implement these tables in a software system, called a relational database management system (RDBMS, or DBMS). – Example DBMSs: Postgresql(we will use), MySQL (both open source), Access, Oracle, DB2, SQLServer • Next we will survey DBMSs and study in depth their query languages: relational algebra and SQL. 11/7/2015 63 Databasics Anonymous 7-step method Done 1. Organizing structured data into an ER Diagram, Ch. 2 2. Transforming an ER diagram into a Schema of Tables, Ch. 3 3. Eliminating anomalies from those tables (normalization), Ch. 21 4. Structuring those tables efficiently (physical design), Ch. 22 Next 5. Managing those tables using the intergalactic standard language (SQL), Ch. 5 – The DBMS manages the tables internally using Relational Algebra, Ch. 4 6. Protecting those tables during concurrent use (ACID properties), Ch. 16 7. Accessing those tables through a web-based interface (some scripting language) 11/7/2015 64 What is a Database? DBMS? • Database (DB) - a collection of structured persistent data, typically >> physical memory – Persistent: exists after its creator terminates • Database management system (DBMS) - a software system that supports the definition, population, querying*, updating and management of a database. DBMS DB *The term “query” typically refers to read, as opposed to update, statements 11/7/2015 65 What is a DBMS, really? • We said a DBMS supports the definition, etc., of a database. So why isn’t Excel a DBMS? • True support means: – A sound theoretical model, the relational model, supporting the data’s structure and access. – An efficient (fast) and relatively simple (declarative) language, SQL, for retrieval of disk-based data. – Disk-based data structures to speed up data retrieval. – Logical and physical independence of programs from data – Safe access to data by multiple users – Recovery from crashes – An extensive security system – Support for parallel and distributed access to data – Support for access to nonstandard types of data • blobs, geometric data, video, etc. 11/7/2015 66 Cast of Characters • Database vendor developers • Database application developers • Database administrators (DBAs) – – – – – Designs & loads the database Security, Authorization Crash Recovery Tuning Help…. 11/7/2015 67 (R)DBMS Languages • We will learn two DBMS languages • The first, relational algebra, was originally defined by Codd as a theoretical language to accompany the relational model, to query relations. It is also the language used internally by a DBMS to query databases. • The second, SQL, is used by programmers to define, populate, query, update and manage databases. 11/7/2015 68 (R)DBMS Architecture* Web Form Applic. Front end SQL interface SQL Security Parser Relational Algebra(RA) Catalog Optimizer Executable Plan (RA+Algorithms) Concurrency Crash Recovery Plan Executor Files, Indexes & Access Methods Database, Indexes 11/7/2015 69 Relations vs. Tables • Recall that a relation is described by a schema such as Candidate(Candid,Name,…). • Relation is a theoretical concept. • Every relation must have a key. • Table is a concept defined in SQL. It is an object stored by a DBMS. It is sometimes called a SQL relation. • A relational database consists of tables • A table has exactly the same properties as a relation except that it need not have a key. A table is a multiset, not a set. 11/7/2015 70 Relations vs. Tables, ctd.* • Recipients, below, is the table (note it does not have a key) of committees that received donations from zip code 97223 in March. • SQL operates on tables only. • Relational algebra operates on relations, in the relational model, and on tables, in a DBMS. • We will study relational algebra operating on tables. Recipients commid C00383554 C00383554 C00383554 C00383554 C00430470 C00431445 C00431445 C00431445 C00437277 C00437277 C00437277 11/7/2015 71 The FEC Database • We're going to learn SQL and relational algebra using the (real) FEC database – See http://www.fec.gov/finance/disclosure/ftpdet.shtml#a2007_2008 • We've stored the four significant tables in a Postgresql DBMS Table Name Rows cand 3833 comm 10287 indiv 928956 pas 172675 Meg Table Meaning 1 3 157 13 Candidates Committees Donations from Individuals to Committees Donations from Committees to Candidates • We gave them short names so you would not have to type so much, and lower case names because Postgresql (PG) does not handle uppercase names well. • The PG database also includes "or" versions (for homework) and "cl" versions (for class) of each table. 11/7/2015 72 Postgresql and the FEC data* • You can access the FEC database with the user name cs386 and the password introdb. You have a choice of two interfaces to the Postgresql FEC DBMS. 1. There is a gui interface at https://dbclass.cs.pdx.edu/phppgadmin 2. and a command line interface from any psu unix machine: $ addpkg # choose pgsql-8. Do this just once $ psql -U cs386 -h dbclass.cs.pdx.edu cs386 • Details are in the notes view. • Postgres documentation is at http://www.postgresql.org/docs/8.3/interactive/ 11/7/2015 73 The CL tables* • In class we will use excerpts from the full database, with the suffix cl to signify "in class". • See the handout and below. CL Tables Document in http://www.cs.pdx.edu/~len/386/fec/Mats.doc 11/7/2015 74 History of SQL • SQL (pronounced EssQueEll or Sequel) was part of System R, now DB2. – Everyone agrees that Ingres’ language, Quel, was technically superior, but IBM’s marketing won the day. • SQL is maintained by ANSI & ISO. The major standards are – 1992 (SQL2) – 1999 (SQL3), added regular expression matching and triggers, recursive queries, procedural and control-of-flow statements, nonscalar types, and some object-oriented features – 2003,6 added XML related features, window functions, standardized sequences & auto-generated values. – See the syllabus for copies of the SQL2,3 standards. • DBMSs typically support almost all of the SQL2 & 3 standards but differ in a few crucial areas, e.g., join syntaxes and stored procedures. • Each Postgresql SQL documentation page tells, at the bottom, if any part of the statement is nonstandard. 11/7/2015 75 NULL Values & Primary Keys in SQL • SQL supports a special value called NULL, meaning that the value is – Unknown, e.g., age of someone unwilling to specify, or – Inapplicable, e.g., spouse of a single employee • Declaring a field to be a primary key in SQL means that it is unique (key in the usual sense) and contains no NULL values. – Note that this differs from the meaning of key in the ER and relational models. – It makes sense because a primary key is the principal means to locate an item and a NULL value would not make sense. – Foreign keys can have NULL values. • Thus when creating a table, declare an attribute to be NOT NULL if it will be a primary key. 11/7/2015 76 DDL for FEC Database • The DDL statements for the FEC database exist in three files: 0. u2uc – not DDL. A shell script to format the FEC files for copying into PostgreSQL 1. Tables.sql - CREATE TABLE statements 2. cs - A shell script that calls psql statements to bulk copy the data into any PostgreSQL database. 3. DDL.sql - Most of the DDL statements: ALTER TABLE, etc. • Don't try to understand them all at this point. FEC DDL Documents http://www.cs.pdx.edu/~len/386/fec/Mats.sql 11/7/2015 77 Why Separate CREATE TABLE? • The no-brainer way to create a database is 1. Issue all DDL statements: CREATE TABLE, create all structures such as primary keys, foreign keys, indexes. 2. Add all data, with INSERT or bulk copy statements. • • • For medium to large databases, this can take forever. As each row is added to the database, the DBMS will search the disk (horrors) to find its place in each structure. So in real life, code like that on page 68 of our text is rarely used. Instead, we 1. Issue all CREATE TABLE statements 2. Bulk copy all data into the database 3. Issue all DDL statements, using ALTER TABLE… • This allows the DBMS, in step 3, to process each structure efficiently. 11/7/2015 78 SQL’s Data Manipulation Language • SQL’s DML consists of these statements: – – – – INSERT INTO table VALUES … SELECT … FROM..[WHERE] UPDATE table SET column = … DELETE FROM table WHERE … • We will study the SELECT clause in detail. The syntax of the UPDATE and DELETE clauses are similar, and all of them can be found in the documentation. • The acronym CRUD (Create(INSERT), Read(SELECT), Update, Delete) is often used to refer to the four basic operations on persistent data. When you have implemented CRUD, your application is complete. 11/7/2015 79 Querying a Database With SQL • SQL’s SELECT statement retrieves data • What are the occupations of donors in indivcl who gave > $1000? SELECT occup FROM indivcl 3 retain this attribute 1 consider rows in this table WHERE amount > 1000; 2 choose rows with this condition • The answer occup PACIFIC CREST/PRINCIPLE PACIFIC CREST/PRINCIPLE SELF-EMPLOYED Notice that SQL's clauses are "out of order". One of SQL's primary problems. 11/7/2015 80 Practice* • Describe in SQL : What are the occupations of donors who made donations in March (the 3rd month)? • What does this query retrieve? SELECT occup FROM indivcl WHERE month = 2 AND month = 3; • Is there anything wrong with this query? 11/7/2015 81 Eliminating Duplicates in SQL • The answer to the previous query contained two copies of the occupation "PACIFIC CREST/PRINCIPLE". • How can we eliminate such duplicates? SELECT DISTINCT occup FROM indivcl WHERE month = 3; 11/7/2015 82 ORDER BY • Display donations in order of month, most recent first, and then in order of donation, smallest first. SELECT * FROM indivcl ORDER BY month desc, amount ASC; • ASC is the default. • SELECT * means retrieve all attributes 11/7/2015 83 LIKE Predicate • Wildcards in SQL are different than regular expressions in UNIX. • %: 0 or more characters, _: One character • List the donations of donors who are retired or selfemployed. SELECT * FROM indivcl WHERE LOWER(occup) LIKE '%retire%' OR LOWER(occup) LIKE '%self%'; • NOT LIKE is also possible • SQL3 includes many string manipulation functions such as LOWER( ) – http://web.cecs.pdx.edu/~len/sql1999.pdf, pg. 14 11/7/2015 84 Structure of SQL • SQL includes two major sections, DDL and DML • DDL, the Data Definition Language, defines the schema of the database. • CREATE statements create the schema. • ALTER statements can change the schema. • DROP statements can remove parts of the schema. • See the Postgresql documentation for detailed syntax of these statements. 11/7/2015 85 Exercises* 1. Draw an ER diagram to keep track of votes taken in the US House of Representatives during the current 2-year congressional session. [Note that the house, with 435 members, is separate from the Senate.] Show all cardinalities and primary keys. • • • • • Keep track of each congressperson ( unique name, district number, party). Keep track of the parties (Republican, Democrat or Independent) and for each of the two major parties, keep track of the name of the party’s (majority or minority) leader. For each bill, keep track of its ID, the date it was first proposed, the date each sponsor signed on to sponsor it and its status. Record how each congressperson voted on each bill (yes, no, abstain or absent), assuming there was only one vote (unrealistic). Record cardinalities and primary keys. 11/7/2015 86 Exercises, ctd.* 2. Identify, with circles and arrows, all the foreign keys in this schema. movies( title, year, genre, studio, producer) movie_star( stage_name, ssn, address, birthday ) stars_in( name, movie, year) studio_info( name, address, president ) producer_data(name, address) • • Assume no two movies in one year have the same title. Assume no two movie stars have the same stage name. 11/7/2015 87 Exercises, ctd.* 3. Describe a schema of tables for this ER diagram from a school where courses are team-taught: Teacher Ssn Name Degree 1..* Belongs To 1..* 0..* Team Name Primary Area PerCt Favorite 1..1 Teaches Date Since 11/7/2015 1..* Course 1..1 Name Description 88