CS186: Introduction to Database Systems

Download Report

Transcript CS186: Introduction to Database Systems

EECS 647: Introduction to Database Systems

Instructor: Luke Huan Spring 2007

Queries for Today

 What is a database?  What is a database management system?

 Why take a database course?

 Who will teach?

 How to take the class?

 Preview of class contents 4/24/2020 Luke Huan Univ. of Kansas 2

What

: Database Systems Then

4/24/2020 Luke Huan Univ. of Kansas 3

What

: Database Systems Today

4/24/2020 Luke Huan Univ. of Kansas 4

What

: Database Systems Today

4/24/2020 Luke Huan Univ. of Kansas 5

What

: Database Systems Today

4/24/2020 Luke Huan Univ. of Kansas 6

So…

What

is a Database?

    A

database

is a very large, integrated collection of data.

Data is a group of facts that can be recorded and have an implicit meaning.

Typically a database is used to model a real-world “enterprise” (or a

miniworld

)   Entities (e.g.,

basketball teams

,

games

) Relationships week) (e.g.

KU’s basketball team

won last Might surprise you how flexible this is   Web search:  Entities: words, documents  Relationships: word in document, document links to document.

P2P filesharing:   Entities: words, filenames, hosts Relationships: word in filename, file available at host 4/24/2020 Luke Huan Univ. of Kansas 7

Main Characteristics of Databases

     Self-describing nature of a database system  A DBMS catalog stores the

description

description is called meta-data . of the database. The Insulation between programs and data  Allows changing data storage structures and operations without having to change the DBMS access Data Abstraction  Use

data model

to hide storage details and present the users with a

conceptual view

of the database Support of multiple views of the data  Each user may see a different view of the database, which describes

only

the data of interest to that user.

Sharing of data and multi-user transaction processing 4/24/2020 Luke Huan Univ. of Kansas 8

Databases make these folks happy ...

 End users in

many

fields  Business, education, science, …  DB application programmers  Build data entry & analysis tools on top of DBMSs  Build web services that run off DBMSs  Database administrators (DBAs)  Design logical/physical schemas    Handle security and authorization Data availability, crash recovery Database tuning as needs evolve  DBMS vendors, programmers  Oracle, IBM, MS … 4/24/2020 9

What

is a Database Management System?

 A Database Management System (DBMS) is a collection of programs that enable uses to create and maintain databases  store, manage, and access data in a databases.

 Typically this term is used narrowly  Relational databases with transactions  E.g. Oracle, DB2, SQL Server   Mostly because they predate other large repositories  Also because of technical richness When we say DBMS follow this convention in this class we will usually  But keep an open mind about applying the ideas!

4/24/2020 Luke Huan Univ. of Kansas 10

What : Is the WWW a DBMS?

 Fairly sophisticated search available  Crawler indexes pages on the web  Keyword-based search for pages  But, currently   data is mostly unstructured and untyped search only:   can’t modify the data can’t get summaries, complex combinations of data   few guarantees provided for freshness of data, consistency across data items, fault tolerance, … web sites typically have a (relational) DBMS in the background to provide these functions.

4/24/2020 Luke Huan Univ. of Kansas 11

What : Is the WWW a DBMS?

 The picture is changing quickly  Information Extraction to get structure from unstructured  New standards e.g., XML, Semantic Web can help data modeling 4/24/2020 Luke Huan Univ. of Kansas 12

What : “Search” vs. Query

 What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

 Try “actors donated to john kerry” in your favorite search engine.

 If it isn’t “published”, it can’t be searched!

4/24/2020 Luke Huan Univ. of Kansas 13

What: A “Database Query” Approach

4/24/2020 Luke Huan Univ. of Kansas 14

What : Is a File System a DBMS?

 Thought Experiment 1:

A

•   

Q: How do you write

You both save it at the same time.

programs over a subsystem when it

D ) Neither

promises you only “???” ?

E ) ???

–The power goes out.

–Which changes survive?

A ) All B ) None C ) All Since Last Save D ) ???

4/24/2020 Luke Huan Univ. of Kansas 15

OS Support for Data Management

 Data can be stored in RAM  this is what every programming language offers!

  RAM is fast, and random access Isn’t this heaven?

 Every OS includes a File System  manages

files

on a magnetic disk  allows

open, read, seek, close

on a file  allows protections to be set on a file  drawbacks relative to RAM?

4/24/2020 Luke Huan Univ. of Kansas 16

Database Management Systems

 What more could we want than a file system?

 Simple, efficient

ad hoc

1 queries  concurrency control  recovery   benefits of good data modeling S.M.O.P.

2 ? Not really…  as we’ll see this semester  in fact, the OS often gets in the way!

1 ad hoc: formed or used for specific or immediate problems or needs 2 SMOP : Small Matter Of Programming 4/24/2020 Luke Huan Univ. of Kansas 17

Current Commercial Outlook

     A major part of the software industry:  Oracle, IBM, Microsoft  also Sybase, Informix (now IBM), Teradata  smaller players: java-based dbms, devices, OO, … Lots of related industries  data warehouse, document management, storage, backup, reporting, business intelligence, ERP, CRM, app integration Traditional Relational DBMS products dominant and evolving   adapted for extensibility (user-defined types), native XML support.

Microsoft merger of file system/DB…?

Open Source coming on strong  MySQL, PostgreSQL, Apache Derby, BerkeleyDB, Ingres, EigenBase And of course, the other “database” technologies  Search engines, P2P, etc.

4/24/2020 Luke Huan Univ. of Kansas 18

Advantages of a DBMS: a short list

       Controlling redundancy Restrict unauthorized access Providing persistent storage for program objects Providing storage structure for efficient query processing Providing backup and crash recovery ….

And many many others that are going to be explored in this class 19 4/24/2020 Luke Huan Univ. of Kansas

What

database systems will we cover?

  We will be try to be broad and touch upon    Relational DBMS (e.g. Oracle, SQL Server, DB2, Postgres) “Semi-structured” DB systems (e.g. XML repositories like Xindice) Data mining: transfer data into knowledge!

Starting point    We assume you have used web search engines We assume you don’t know relational databases  Yet they pioneered many of the key ideas So focus will be on relational DBMSs  With frequent side-notes on search engines, XML issues 4/24/2020 Luke Huan Univ. of Kansas 20

A.

B.

C.

D.

E.

Why take this class?

Database systems are at the core of CS They are incredibly important to society The topic is intellectually rich It isn’t that much work Looks good on your resume Let’s spend a little time on each of these 4/24/2020 Luke Huan Univ. of Kansas 21

Why

take this class?

A. Database systems are the core of CS

  Shift from computation to information    True in corporate computing for years Web, p2p made this clear for personal computing Increasingly true of scientific computing Need for DB technology has exploded in the last years   Corporate : retail swipe/clickstreams, “customer relationship mgmt”, “supply chain mgmt”, “data warehouses”, etc.

Web :not just “documents”. Search engines, e-commerce, blogs, wikis, other “web services”.

  Scientific : digital libraries, genomics, satellite imagery, physical sensors, simulation data Personal : Music, photo, & video libraries. Email archives. File contents (“desktop search”). 4/24/2020 Luke Huan Univ. of Kansas 22

Why take this class?

B. DBs are incredibly important to society

 “Knowledge is power.” -- Sir Francis Bacon  “With great power comes great responsibility.” -- SpiderMan’s Uncle Ben Policy-makers should understand technological possibilities.

Informed Technologists needed in public discourse on usage.

4/24/2020 Luke Huan Univ. of Kansas 23

Why

take this class?

C. The topic is intellectually rich.

     representing information  data modeling languages and systems for querying data   complex queries & query semantics* over massive data sets concurrency control for data manipulation   controlling concurrent access ensuring transactional semantics reliable data storage  maintain data semantics even if you pull the plug data mining  Let your data speak * semantics: the meaning or relationship of meanings of a sign or set of signs 4/24/2020 Luke Huan Univ. of Kansas 24

Why

take this class?

D. It isn’t that much work.

 Bad news: It is a lot of work.

 Good news: the course is front loaded  Most of the hard work is in the first half of the semester  Load balanced with most other classes 4/24/2020 Luke Huan Univ. of Kansas 25

Why take this class?

E. Looks good on my resume.

 Yes, but why? This is not a course for:   Oracle administrators   IBM DB2 engine developers Though it’s useful for both!

It is a course for well-educated computer scientists   Database system concepts and techniques increasingly used “outside the box”  Ask your friends at Microsoft, Yahoo!, Google, Apple, etc.

 Actually, they may or may not realize it!

A rich understanding of these issues is a basic and (un?)fortunately unusual skill.

4/24/2020 Luke Huan Univ. of Kansas 26

Who?

 Instructors  Prof. Luke Huan, EECS  [email protected]

 Prof. Office Hours:  Eaton Hall 2034, M, W 4:15-5:15pm 4/24/2020 Luke Huan Univ. of Kansas 27

How?

Workload

      Projects with a “real world” focus: Build a web-based application w/PostgreSQL and your favorite internet programming languge (PHP, JAVA, PERL, just naming a few) Homework assignments and quizzes Exams – 1 Midterm & 1 Final Projects to be done in groups of 2  Pick your partner ASAP The course is “front-loaded”  most of the hard work is in the first half 4/24/2020 Luke Huan Univ. of Kansas 28

How

? Administrative

 Text book: Elmasri & Navatgh 5 th edition  Class website - people.eecs.ku.edu/~jhuan/EECS647  read it regularly  Please include eecs 647 in your mail to the instructor (for quick response)  Grading, hand-in policies, etc. is on syllabus * The textbook will be available at the KU bookstore today or tomorrow 4/24/2020 Luke Huan Univ. of Kansas 29

Academic Misconduct

   We take academic misconduct

very

seriously.

We encourage student to work together for doing homework and projects, but each student should write down their

own

solution.   You are absolutely encouraged to discuss with your classmates about your homework assignments, programming exercises, and final projects You are responsible for all your works For homework assignments and projects resulted from discussion, please always acknowledge other people’s contribution by including a sentence at the beginning of the hand-in saying “I discussed the homework (project) with XXX (include more names if necessary)”. There is absolutely NO penalty for doing so .

4/24/2020 Luke Huan Univ. of Kansas 30

Agenda for the rest of today

 Today: a preview of the class contents:   ER model Logical database design   Relational model SQL   File system Transaction  Query processing  Data mining  XML  Database in the future  This Wednesday  the Entity-Relationship model   Today’s lecture is from Chapter 1 and 2 in E & N Read Chapter 3 for next class.

4/24/2020 Luke Huan Univ. of Kansas 31

Agenda …

 Design a Database  ER model  Relational model  Database logical design  Query with SQL 4/24/2020 Luke Huan Univ. of Kansas 32

Data Models Describe Data

 A

data model

is a collection of concepts for describing data.

 Different levels of data models  Conceptual (semantic) data model uses data (facts) to describe a physical world  Representational model describes data in database management systems  Physical data model describe how data is stored in files in computers. 4/24/2020 Luke Huan Univ. of Kansas 33

An ER Model for a mini-World

ssn name Salary Employees number name manager Works_For Departments

Entity:

Real-world object distinguishable from other objects. An entity is described (in DB) using a set of attributes.

Relationship

: Association among two or more entities. E.g., John works in Pharmacy department.

Relational Model Describes Data

Relational model

is the most popular representational data model that are used by current commercial DBMS.

 Relational model describes data using tables SSN 123456789 333444555 453453453 Name John Smith Franklin Wong Joyce English Salary 30000 40000 25000

ssn

  Cardinality = 3, degree = 3, all rows distinct Do all columns in a relation instance have to be distinct?

name Employees Salary

4/24/2020 Luke Huan Univ. of Kansas 35

Logical Database Design

SSN 123456789 453453453

ssn name Salary Employees number name manager Works_For Departments

Name John Smith 333444555 Franklin Wong Joyce English Salary 30000 40000 25000 DNumber 5 DName research Manager 123456789 SSN 123456789 333444555 453453453 Dnum 5 5 5 4/24/2020 Luke Huan Univ. of Kansas 36

Logical Database Design (Cont.)

ssn name Salary Employees number name manager Works_For Departments

SSN 123456789 Name John Smith 333444555 Franklin Wong 453453453 Joyce English Salary 30000 40000 25000 DNum 5 5 5 DNumber 5 DName research Manager 123456789 4/24/2020 Luke Huan Univ. of Kansas 37

Logical Database Design (Cont.)

ssn name Salary Employees number name manager Works_For Departments

SSN 123456789 333444555 453453453 Name John Smith Franklin Wong Joyce English Salary 30000 40000 25000 DNumber 5 5 DName research research Manager 123456789 123456789 5 research 123456789 4/24/2020 Luke Huan Univ. of Kansas 38

SQL: Query Information from a Database

 Query 1: Retrieve the salary of Franklin Wong

Q1: SELECT FROM SALARY EMPLOYEE WHERE NAME=‘

Franklin Wong

Answer: 40000 SSN Name 123456789 333444555 Franklin Wong 453453453 John Smith Joyce English Salary 30000 40000 25000 4/24/2020 Luke Huan Univ. of Kansas 39

Agenda …

 Design a Database Management System    File systems  Disk storage  Indexing Query processing Transactions  Concurrency  Recovery 4/24/2020 Luke Huan Univ. of Kansas 40

Components of a Disk

Spindle Tracks  Disk head The platters spin (say, 90rps).

 a The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make

cylinder

(imaginary!).

Arm movement Sector Platters Arm assembly

Indexing is the Key to Speed up Disk Operations

 ``

Find all employees whose salary > $25000

’’  If data is in sorted file, do binary search to find first such employee, then scan to find others.

  Cost of binary search can be quite high.

Simple idea: Create an `index’ file.

Index File k1 k2 kN Page 1 Page 2 Page 3 Page N

Can do binary search on (smaller) index file!

Data File

Concurrent execution of user programs

 Why?

 Utilize CPU while waiting for disk I/O  (database programs make heavy use of disk)  Avoid short programs waiting behind long ones  e.g. ATM withdrawal while bank manager sums balance across all accounts 4/24/2020 Luke Huan Univ. of Kansas 43

Key concept:

Transaction

 an atomic sequence of database actions (reads/writes)  takes DB from one consistent state to another consistent state 1 transaction consistent state 2 4/24/2020 Luke Huan Univ. of Kansas 44

Example

checking: $200 savings: $1000 Transaction “transfer $100 from Saving to checking” checking: $300 savings: $900    Here,

consistency

“semantics” is based on our knowledge of banking In general, up to writer of transaction to ensure transaction preserves consistency DBMS provides (limited) automatic enforcement, via integrity constraints  e.g., balances must be >= 0 4/24/2020 Luke Huan Univ. of Kansas 45

Agenda …

 Advanced topics    XML Data mining  Association  Classification  Clustering Database in the future 4/24/2020 Luke Huan Univ. of Kansas 46