Query Processing, Resource Management and Approximate in a

Download Report

Transcript Query Processing, Resource Management and Approximate in a

DATA

An emerging idea in data mining is to

vertically

structure data and to process it

across

those vertical structures. This is in contrast to the

traditional

method of structuring data into

horizontal

structures and processing

down

those horizontal structures (The horizontal structures are called

records

, e.g., an employee file containing horizontal employee records, likely made up fields such as, Name, Address, Salary, Phone, etc.) Again, Horizontal Processing of Vertical Data (HPVD) is an alternative to the traditional Vertical Processing of Horizontal Data (VPHD). Do we need to structure and process data differently than we have in the past? If so, why? What has changed? Data (digital data) has gotten really BIG and processing speed and power has not kept pace. Data has gotten very wide (in terms of the number of columns or attributes or fields -

width

of the file, table or dataset) in some cases, but the main explosion has been in the number of instances of the data (The number of rows, tuples or records - the

depth

of the file, table or data set).

Thus the vertical processing of data in traditional VPHD can take forever!

How big (deep) is BIG DATA these days and how big will it get?

Example: US Library of Congress

is storing EVERY tweet sent since Twitter launched in 2006. Each tweet record contains 50 fields (not really terribly wide) but there are billions+ of them (very deep). Let's assume each tweet records is about 1000 bits wide (so, on average, 20 bits per field). Let's estimate approximately 1 trillion tweets from 1 billion tweeters, to 1 billion tweetees over 10 years of tweeting? As a full 3-dimensional dataset, that's 10 30 data items (10 12 *10 9 * 10 9 )

That's BIG! Is it going to get even bigger? Yes.

1. I turned the 1620 switch on.

Let’s look at how the definition of big data has evolved just over my work lifetime My first CS job was as THE technician at St. John’s Univ IBM 1620 Computer Center.

2. I waited for the ready light bulb to come on (~15 minutes) 3. I put the Op/Sys punch card stack on the card reader (~4 inch high stack of cards like  ) 4. I put the FORTRAN compiler card stack on the reader (~3 inch high stack) 5. I put the FORTRAN program card stack on the reader (~2 inch high stack) 6. The 1620 produced an object code stack which I read in (~1 inch high stack) 7. I read in the object stack and a

1964 BIG DATA stack

(~40 inch high stack)

How high would the Big Data STACK reach today if it were put on punch cards?

Let's be conservative and assume an exabyte (2 18 bytes) of data on cards. So how high is an exabyte punch card stack? Take a guess...............? We're being conservative because the US LoC tweet DB may be > ~10 30 B (if it's fully and losslessly stored in all dimensions: tweet, tweeter, tweetee, hashtag, …)).

That exabyte stack of punch cards would reach to JUPITER!

So, in my work lifetime,

BIG DATA

has gone from a 40 inch high stack of cards to one that reaches all the way to Jupiter! What will happen to BIG DATA over your work lifetime? I don’t know, but I predict that any estimate you make will be way low.

I must deal with the LoC Tweet Dataset which would reach Jupiter as a punch card stack. To do so, I find that I need to You may have to deal with a Tweet Dataset that would reach the end of space (if on cards), but you can

some new way – there is always a better way to do anytbing!!!).

be able to

structures horizontally or deal with it effectively in some new way that you invented!! (or that they invent).

compression step is unnecessary because processors are so fast at bit AND/OR/COMPLEMENT operations these days/).

structure it losslessly as 1000 extendable vertical pTrees and write programs to process horizontally across those 1000 vertical structures.

structure it losslessly as 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally (or deal with it in

The next generation may have to deal with a data file that creates new space, but if we do our job as data mining researchers, they weill

structure it losslessly by 1000 extendable vertical pTrees and write programs to process across those 1000 vertical

You will be able to use my Horizonal processing code! The next generation will be able to use my code (or your new code)!

It seems clear to me that BIG DATA WILL HAVE TO BE COMPRESSED and that data will have to be VERTICALLY structured. Let's take a quick look at how one might organize and compressed vertical data (and keep in mind that, so far, we have found that the

A Vertical Data Structuring

Traditional Vertical Processing of Horizontal Data

e.g., find the number of occurences of

7 0 1 4

(VPHD)

=2 using vertical pTrees

find number occurrences of Base 10

R(A 1 A 2 A 3 A 4 ) 7 0 1 4

Base 2 for Horizontally structured,

record-

oriented data, one scans vertically

2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 = predicate Trees = pTrees

: slice by column (4 vertical structures).

010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100

vertically slice off each bit position (12 vertical structures) then compress each bit slice into a

tree

using a predicate (We will walk thru the compression of R 11 into pTree, P 11 )

R[A

] R[A

]

010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100

Imagine an excillion records, not just 8 (We need speed!).

R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the

pure1? false=0

not single bits (eg, ≥64 bit strings or strides).

half is pure.

So far, we have found that the step of compressing the vertical bit strips or slices into tree structures (pTrees) is unnecessary because processors are so fast at bit AND/OR/COMPLEMENT operations ).

1. Whole thing pure1? false  0 2. Left half pure1? false  0 3. Right half pure1? false  0 4. Left half of rt half ? false  0 5. Rt half of right half? true  1 But it's pure0 so this branch ends

P 11 0 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 P 11 ^P 12 ^P P 13 11 0 0 0 0 1 ^ P’ P 1 To count (7,0,1,4) 21 12 s ^P’ P 22 13 P 21 0 0 0 ^ ^ 0 1 0 use 111 ^P’ 23 ^P’ P 00000 31 22 0 1 0 1 0 11 ^P’ P 00 32 23 0 0 0 ^ 7 0 1 4 ^ P 33 P ^P 31 0 1 0 41 ^ P 32 0 0 0 1 0 P’ 42 P 0 0 1 0 1 = 33 ^P’ 43 P 41 0 0 1 ^ P 42 P 43 0 0 0 0 0 0 ^ ^ ^ 0 *2 3 0 0 *2 2 0 1 *2 1 *2 0 =2

ANSI schema model

Levels of Data Abstraction defined by various schem” levels View 1 View 2 View 3 Many

views

Conceptual (logical) schema Physical schema

Views describe how users see data (different data models for different views) Conceptual schema defines logical structure of entire data enterprise

Physical schema describes underlying files and indexes used.

Schemas are defined using Data Definition Languages or DDLs ; data are modified/queried using Data Manipulation Languages or DMLs

Structure of a DBMS A typical DBMS has a layered architecture.

This is one of several possible architectures. Another with a little more detail on next slide.

Conceptual Schema Physical Schema

QUERIES

from users (or Transactions or user-workload requests)   

SQL (or some other User Interface Language)

QUERY OPTIMIZATION LAYER

Relational Operators (Select, Project, Join)

DATABASE OPERATOR LAYER

File processing operators (open,close file,read/write record

FILE MANAGER LAYER (provide the file concept)



Buffer managment operators (read/flush page)

BUFFER MANAGER LAYER

 

DB on DISK

Disk transfer operators (malloc, read/write block

DISK SPACE MANAGER LAYER

Query Opt/Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Layers w conc ctrl/ rec

DISK SPACE MANAGER

deals with space on disk offers an interface to higher layers (mainly the BUFFER MGR) consisting of: allocate/deallocate space; read/write block can be implement on a

raw disk system

directly, then it would likely access data as follows:

read block b of track t of cylinder c on disk d

or can use

OS file system

(OS file = sequence of bytes) then it would likely access data as follows:

read bytes b of file f

and then the Operating System file manager would translate that into

read block b of track t of cylinder c on disk d

peculiarities (limitations) most systems do not use OS files system for portability reasons and to avoid OS file size

BUFFER MANAGER

partitions the main memory allocated to the DBMS into buffer page frames,brings pages to and from disk as requested by higher layers (mainly the FILE Mgr).

FILE MANAGER

supports the file concept to higher layers (DBMS file = collection of records and pages of records), supports access paths to the data in those files (e.g., Indexes). Not all Higher level DBMS code recognizes/ uses page concept. Almost all DBMS use the record concept, though.

DATABASE OPERATOR LAYER

implements physical data model operators (e.g., relational operators; select, project, join.

QUERY OPTIMIZER

produces efficient execution plans for answering user queries (e.g., execution plans as trees of relational operators:

select, project, join, union, intersect

translated from, e.g., SQL queries). SQL is not adequate to answer all user-database questions, e.g., Knowledge workers working on Data Warehouses ask "what if" questions (On-Line Analytic Processing or OLAP) not retrieval questions (SQL)

Overview of Database Design

Conceptual design

 What are the

entities

and

relationships

in the enterprise?

 What information about these entities and relationships should be stored in the database?

 What

integrity constraints

business rules

should be enforced? A database `schema’ Model diagram answers these question pictorially (Entity-Relationship or

ER diagrams

 Then one maps the ER diagrams into a relational schema (using the Data Definition Language provided)

Entity:

Real-world object type distinguishable from other object types.

name

An entity is described using a set of

Attributes

. Each entity set has a

key

. (which is the chosen identifier attribute(s) and is

underlined

in these notes)

ssn Employee lot

Each attribute has a

domain

.(allowable value universe)

Relationship

: Association among two or more entities . E.g., Employee Jones works in Pharmacy department .

ssn name lot

Relationships can have attributes too!

since dname did budget ssn name lot

Must specify the

“role”

of each entity to distinguish them.

Employee Works_In Department Employee

Degree=2 relationship between entities, Employees and Departments.

super visor subor dinate Reports_To

Degree=2 relationship between an entity and Itself? E.g., Employee Reports_To Employee.

Relationship Cardinality Constraints

(

many-to-many

) Works_In: An employee can work in many departments. A dept can have many employees working in it.

(

1-many

) e.g., Manages: It may be required that each dept has at most 1 manager.

(

1-1

) Manages: In addition it may be required that each manager manages at most 1 department.

ssn ssn name Employee name Employee lot

lot

since Works_In since dname did

Department budget Manages dname

did Department budget 1-to-1 1-to Many Many-to-1 Many-to-Many ssn

Participation Constraints: Every department may have to have a manager?

This is an example of

total participation constraint

: the participation of Department in Manages is said to be

total

(vs.

partial

) .

Employees lot Manages Works_In since

total

Departments ISA (`is a’) Hierarchies

We can use attribute inheritance to save repeating shared attributes.

If we declare an ISA relationship among entity types, e.g., A

ISA

B (every instance of A entity is also an instance entity of entity B), then B entities “inherit” A entity attributes e.g., every Hourly_Emp ISA Employee every Contract_Emp ISA Employee Hourly_Emps and Contract_Emps can have their own separate attributes also.

hourly_wages

Overlap constraints

: Can Joe be an Hourly_Emp and a Contract_Emp? (Allowed/disallowed)

Covering constraints

: Does every Employee entity also have to be an Hourly_Emp or a Contract_Emp entity?

(Yes/no)

ssn name lot hours_worked Hourly_Emp Employee

Covering yes

ISA

Overlap allowed

contractid Contract_Emp

Relational database:

a set of

relations

Relation:

made up of 2 parts:

Instance or occurrence

: a

table,

with rows and columns. #Rows =

cardinality

, #fields =

degree Schema or type

: specifies name of relation & name, type of each attribute Students(

sid

: string,

name

: string,

login

: string,

age

: integer,

gpa

: real).

Strictly, a relation is a

set

tuples

but it is common to think of it as a table (sequence of rows made up of a sequence of attribute values) A major strength of the relational model: supports simple, powerful

querying

of data. Queries can be written intuitively (specifying what, not how), DBMS is responsible for evaluation. DBMS does your programming!

Allows a module called the optimizer to extensively re-order operations (even combine similar operations from different concurrent requests), and still ensure that the answer does not change.

SQL is one of the simplest languages on earth - very English-like! Specify what, not how.

• E.g., SELECT attributes FROM relations WHERE condition

What columns you want What rows you want.

Find all 18 year old students ( a selection )

sid name login

SELECT * FROM Students S WHERE S.gpa=3.4

age gpa sid name login 53666 Jones jones@cs 18 3.4

age gpa 53666 Jones jones@cs 18 3.4

53688 Smith smith@ee 18 3.2

Find just names and logins ( a projection ), replace 1 st line: SELECT S.name, S.login

FROM Students S WHERE S.age=18

name login Jones jones@cs

Querying Multiple Relations ( Join,

implemented using nested loop – alternative 1

)

• What does the following query produce?

sid name SELECT FROM WHERE S.name, E.cid

Students S, Enrolled E login S.sid=E.sid AND 53666 Jones jones@cs age gpa 18 3.4

Where also used to combine (join) S & E

E.grade=“A” sid cid 53831 Carnatic101 53831 Reggae203 grade C

suceeds

But Select fails

B 53650 Smith smith@ee 18 3.2

53650 Topology112 A 53666 History105 B we get:

S.name

Smith E.cid

Topology112

Destroying and Altering Relations (also DDL)

DROP TABLE Students Destroys the relation Students. The schema information ALTER TABLE Students ADD COLUMN Year: integer

and

the tuples are deleted.

Schema of Students is altered by adding a new field; every current tuple is extended, e.g., with

null

in the new field.

Adding and Deleting Tuples

insert a single tuple using: INSERT INTO Students (sid, name, login, age, gpa) VALUES (53688, ‘Smith’, ‘smith@ee’, 18, 3.2) delete all tuples satisfying some condition (e.g., name = Smith): DELETE FROM Students S WHERE S.name = ‘Smith’

many powerful variants of these commands are available!

View

is a relation constructed from stored or base relations. Store a

definition of it

, rather than the instance (actual tuples) CREATE VIEW Young ActiveStudents (name, grade) AS WHERE S.sid = E.sid

and S.age<21 SELECT S.name, E.grade FROM Students S, Enrolled E Views can be dropped using the DROP VIEW command. How to handle DROP TABLE if there’s a view on the table?

DROP TABLE command has options to let user specify this.

Views can be used to present necessary information (or a summary), while hiding details in underlying relation(s).

Integrity Constraint, IC:

condition that must be true for

any

instance in the database; e.g.,

domain constraints.

ICs are specified when (or after) relations are created. ICs are checked when relations are modified. A

legal

instance of a relation is one that satisfies all its ICs. DBMS should not allow illegal instances. Avoids data entry errors, too!

Primary Key Constraints:

A set of fields is a

key

(strictly speaking, a candidate key) for a relation if it satisfies: 1.

( Uniqueness condition ) No two distinct tuples can have same values in the key (which may be a composite) 2. ( Minimality condition ) The Uniqueness condition is not true for any subset of a composite key. If Part 2 is false, it’s called a

superkey

(for

superset of a key

). There’s always at least one key for a relation, one of the keys is chosen (by DBA) to be the

primary key

, the primary record identification or lookup column(s), E.g.,

sid

is a key for Students. Set {

sid, gpa

} is a superkey.

Foreign keys and Referential Integrity

: A field (or set of fields) in one relation used to `refer’ to a tuple in another relation. (by listing the the primary key value in the second relation.) Like a `logical pointer’. E.g.

sid

in ENROLL is a foreign key referring to

sid

in Students (sid is the primary key of S)If all foreign key constraints are enforced, a special integrity constraint,

referential integrity

, is achieved, i.e., no dangling references E.g., if Referential Integrity is enforced (and it almost always is) an Enrolled record cannot have a sid that is not present in Students ( students cannot enroll in courses until they register in the school )

Enrolled sid cid 53666 Carnatic101

Foreign Keys

grade C 53666 Reggae203 B Students

sid name login 53666 Jones jones@cs 53688 Smith smith@eecs 53650 Smith smith@math

CASCADE

(also delete all tuples that refer to deleted tuple)

SET NULL / SET DEFAULT

foreign key value of referencing tuple) (sets

ICs are based on the real-world enterprise semantics that is described in the DB. I.e., the users decide semantics, not the DB experts! Why? We can check a DB instance to see if an IC is violated, but we can NEVER infer an IC by only looking at the data instances.

An IC is a statement about

all

possible

instances! It is not a statement that can be inferred from the set of currently existing instances. If IC s were inferred from current instances, then when a relation is newly created and has, say, just 2 tuple, many, many ICs would be inferred (e.g., in the table at right, The system might infer that students MUST be 18 or that names have to be 5 characters or worse yet, that gpa ranking must be the same as alphabetical name ordering!

sid 53666 name Jones login jones@cs age 18 gpa 3.4

Key and foreign key ICs are the most common.

53688 Smith smith@ee 18 3.5

Who decides primary key

, etc.?The Database design expert?

NO!

Not in isolation, anyway. Someone from the enterprise who understands the data and the procedures should be consulted. The following story illustrates this point. CAST: Mr. Goodwrench = MG (parts manager); Pointy-headed Dbexpert = Ph D Ph D I've looked at your data, and decided Part Number (P#) will be designated the primary key for the relation, PARTS(P#, COLOR, WT, TIME-OF-ARRIVAL).

MG Ph D You're the expert.

Well, according to what I’ve learned in school, P# should be the primary key, because IT IS the lookup attribute!

. . . later MG Why is lookup so slow?

Ph D You do store parts in the stock room ordered by P#, right?

MG No. We store by weight!

When a shipment comes in, I take each part into the back room and throw it as far as I can. The lighter ones go further than the heavy ones so they get ordered by weight!

Ph D But, but… weight doesn't have Uniqueness property!

Parts with the same weight end up together in a pile!

MG Ph D MG No they don't. I tire quickly, so the first one I throw goes furthest.

Then we’ll use a composite primary key, (weight, time-of-arrival).

We get our keys primarily from Curt’s Lock and Key.

Point: This conversation should have taken place during the 1 st meeting.

An ER Example:

COMPANY is described to us as follows: Company is organized into depts - each with a name, number, manager. - Each manager has a startdate. - Each dept can have several locations. Departments control projects - each with a name, number, location. Each employee has a name, SSN, sex, address, salary, birthdate, dept, supervisor. - An employee may work on several projects (not necessarily all controlled by his dept) for which we keep hoursworked by project. Each employee dependent has a name, sex, birthdate, relationship.

In ER diagrams, entities are represented in boxes: |EMPLOYEE| |DEPENDENT| |DEPT| |PROJECT| An attribute (or property) of an entity describes that entity. An ENTITY has a TYPE, including name and list of its attributes. ENTITY TYPE SCHEMA describes the common structure shared by all entities of that type. Project (Name, Num,Location, Dept) ENTITY INSTANCE = individual occurrence of an entity of a particular type at a particular time (Dome, 46, 19 Ave N & Univ, Athletics) (IACC, 52, Bolley & Centennial, C.S.) (Bean Res, 31, 12 Ave N & Federal, P.S.) . . . Entity Type does not change often. Entity instances get added, changed often - very dynamic ATTRIBUTES are written next to Entity they describe.

CATEGORIES OF ATTRIBUTES: COMPOSITE ATTRIBUTE = subdivided into smaller parts with indep meaning, e.g., Name of Employee subdivided into FName, Minit, LName. Indicated: Name (FName, Minit, LName) Also, WorksOn may be a composite attr of Employee of Project and Hours: WorksOn (Project, Hours) SINGLE-VALUED ATTRIBUTE: one value per entry. MULTIVALUED ATTRIBUTE (repeating grp) have multiple values per entry: eg, Locs (as attribute of Dept since Dept can have multiple locs) Multivalued Attribute, use {Locations} WorksOn may be a mutlivalued attr of Employee as well composite: {WorksOn (Project,Hours)} DERIVED ATTRIBUTE: value can be calculated from other attribute values. eg, Age calculated from BirthDate and CurrentDate. KEY ATTRIBUTE: Each value can occur at most once. (has uniqueness) Used to identify entity instances. We will * key attribute(s). ATTRIBUTE DOMAIN: Set of values that may be assigned (Value Set). Thus the Preliminary Design of Entity Types for COMPANY db is.

*Name-------------. *Number-----------| Locations--------|--|DEPARTMENT| Manager----------| ManagerStartDate-' .---Name |--*SSN |__EMPLOYEE__|----|---Sex |---Address |---Salary |---BirthDate |---Department |---Supervisor `---WorksOn *Name-------------. *Number-----------| Location---------|-|_PROJECT| ControlDepartment' .--Employee |-*DependentName |_DEPENDENT|--|--Sex |--BirthDate `--Relationship Relationships have TYPEs (consisting of names of entities, relationship). A Relationship type diagram for a relationship between EMPLOYEE and DEPARTMENT called "WorksFor" is diagrammed: (in a roundish box) |EMPLOYEE|-( WorksFor )-|DEPARTMENT| RELATIONSHIP INSTANCEs for above relationship might be, eg: ( John Q. Smith, Athletics ), ( Fred T. Brown, Comp. Sci.), ( Betty R. Hahn, Business ). . . RELATIONSHIP DEGREE: Number of participating entities (usually 2). If an entity participates more than once in the same relationship, then ROLE NAMES are needed to distinguish multiple participations. eg, Supervisor, Supervisee in Supervision relationship (Called Reflexive Relationships.

One decision that has to be made is to decide whether attribute or relationship is the appropriate way to model, e.g., "WorksOn". Above we modeled it as an attribute of EMPLOYEE {WorksOn(Project,Hours)} The fact that it is multivalued and composite (involving another entity, project) ssuggest that it would be better to model it as a relationship (i.e., it makes a very complex attribute!) WORKS_FOR(EMPLOYEE, DEPARTMENT)

CONSTRAINTS ON A RELATIONSHIP CARDINALITY CONSTRAINT can be 1-to-1 many-to-1 1-to-many or many-to-many 1-1: MANAGES(EMPLOYEE, DEPARTMENT) Each manager MANAGES 1 dept Each dept is MANAGED-BY 1 manager

ER Example

COMPANY Entity-Relationship Diagram (showing the Schema) (double connecting lines means "total" while single line means partial participation.) Many-1: WORKS_FOR(EMPLOYEE, DEPARTMENT) Each employee WORKS_FOR 1 dept. Each dept is WORKED_FOR by many emps Many-Many: WORKS_ON(EMPLOYEE, PROJECT) Each employee WORKS_ON many projects. Each project WORKED_ON by many emps PARTICIPATION CONSTRAINT (for an entity in a relationship) can be Total, Partial or Min-Max Total: Every EMPLOYEE WORKS_FOR some DEPARTMENT Partial: Not every EMPLOYEE MANAGES some DEPT RELATIONSHIP can have ATTRIBUTES (properties), eg, Hours for WORKS_ON Relationship, Manager_Start_Date in MANAGES rel.

6 RELATIONSHIPS; CARDINALITY ---------- 1:1 1:many many:many MANAGES WORKS_ON (role names, if any, above) RELATIONSHIP ----------- WORKS_FOR total ATTRIBUTES -(partic below) (EMPLOYEE, DEPARTMENT) partial total (DEPARTMENT, EMPLOYEE) total (EMPLOYEE, PROJECT) total total 1:many CONTROLS (DEPT, PROJECT) partial total Reflexive relationship with role names supervisor supervisee 1:many 1:many SUPERVISION partial DEPENDENTS_OF ( partial (EMP, EMPLOYEE) partial EMP, DEPENDENT) total *Name-----------. ( MANAGES ) 1|| |1 || (WORKS_FOR) || 1|| many|| | | *Number---------| ( CONTROLS ) || || || / {Locations}-----|- DEPARTMENT // / number_employees' /1 // / .----' // / // / many // / || || 'er| |'ee // / || (SUPERVISE) // / || | | // / || 1| |many // / // / || |____|_______//_____/ .Name(FN,Mi,LN) || |_EMPLOYEE__________|---|-*SSN || // | |-Sex || // | |-Address || Hours-. // | |-Salary || | /many | `-BirthDate \\ (WORKS_ON) | \\ 1| \\ \\ many| | || ( Dependent_0f ) \\ || |many *Nane-. \\_______||___ || *Numb-|--| PROJECT | || Locatn' || || *DependentName---. .

Sex--------------|--|| DEPENDENT || BirthDate--------| Relationship-----'

Data Issues and Problems:

CENTRALITY OF DATA

Data are central to every computer program.

If a program has no data, there is no input, no output, no constants, no variables...

It is hard to imagine a program with no data. Thus, virtually all programs are data management programs and therefore, virtually all computing involves data management.

However, not the all data in computer programs is RESIDUALIZED.

RESIDUALIZED data

is data stored and managed after the termination of the program that generated it (for reuse later).

Database Management Systems (DBMSs) store and manage residualized data.

HUGE VOLUME

(EVERYONE HAS LOTS OF DATA AVAILABLE TO THEM TODAY!) Data are collected much faster than data are process or managed.

NASA's Earth Observation System (EOS), alone, has collected over 15 petabytes of data already (15,000,000,000,000,000 bytes).

Most of it will never be use! Most of it will never be seen! Why not? There's so much volume, usefulness of much of it will never be discovered.

SOLUTION: Reduce the volume and raise the information density through structuring, querying, filtering, mining, summarizing, aggregating... That's the main task of Data and Database workers today! Claude Shannon's information theory principle comes into play here:

More volume means less information.

A simple illustration: Which phone book has more useful information? (both have the same 4 data granules; Smith, Jones, 234-9814, 231-7237)

BOOK-1 BOOK-2 Name Number Smith 234-9816 Jones 231-7237 Name Smith Smith Number 234-9816 231-7237

Data analysis, querying and mining reduce volume and raises info level

Jones Jones 234-9816 231-7237 The Red Book has no useful information!

Proper Structuring of data

work today is may be the second most important task in data and database system! At the highest level, is the decision whether a data set should be structured as horizontal or vertical data (or some combination). Another important task to be addressed in data systems

RESIDUALIZATION OF DATA

MUCH WELL-STRUCTURED DATA IS DISCARDED PREMATURELY DBs are about storing data persistently, for later use.

RESIDUALIZING DATA

may be the third most important task!

WHAT IS A DATABASE?

There are many definitions in the literature. An integrated shared repository of operational data of interest to an enterprise.

INTEGRATED : it must be the unification of several distinct files. SHARED : same data can be used by more than 1 user (concurrently) REPOSITORY : implies "persistence". OPERATIONAL DATA : data on accounts, parts, patients, students, employees, genes, stock, pixels,... By contrast, nonoperational incl. I/O data, transient data in buffers, queues... ENTERPRISE : bank, warehouse, hospital, school, corp, gov agency, person..

What is a DBMS? A program which organizes and manages access to residual data. Databases also contains METADATA also (data on the data).

Metadata is non-user data which contains the descriptive information about the data and database organization (i.e., Catalog data).

Why use a DBMS? COMPACTNESS (

saves space - no paper files necessary).

EASE OF USE

(less drudgery, more of the organizational and search work done by the system; user specifies what, not how).

CENTRALIZED CONTROL (

by DB Administrator (DBA) and by the CEO).

REDUCES REDUNDANCY

(1 copy is enough, but concurrent use must be controlled

NO INCONSISTENCIES

(again, since there is only 1 copy necessary).

ENFORCE STANDARDS

(corporate, dept, industry, national, international).

INTEGRITY CONSTRAINTS

(automatically maintained) (e.g., GENDER=male => MAIDEN_NAME=null).

BALANCE REQUIREMENTS

(even conflicting requirements? DataBase Administrator (DBA) can optimize for the whole company).

DATA INDEPENDENCE

(occurs because apps are immune to storage structure and access strategy changes.

HORIZONTAL DATA Almost all commerical databases today are HORIZONTAL. That is, the contain horizontally structure data. Horizontal data is data is formed into files of horizontal records of a common type.

HORIZONTAL DATA TERMINOLOGY

FIELDS, RECORDS, FILES: stored (physical, on disk) logical (as viewed by user) FIELDS, RECORDS, FILES type (e.g., datatype) occurrences (instances) TYPE : defines structure/expected contents (time-independent - changes only upon DB reorg) OCCURRENCE : actual data instances at a given time (time-dependent - changes with every insert/delete/update)

STORED FIELD

is the smallest unit of stored data in a database.

Jones

Lname

stored field occurrence

Char 25=metadata type of that occurrence.

STORED RECORD

is a named horizontal concatenation of related stored fields.

e.g., | Jones | John | 412 Elm St | Fargo | ND | 58102 |

an instance

Lname Fname Address City St Zip

field names

Lname(char25), Fname(char15), Address(char20), City(char15), St(char2), Zip(char5)

field types

STORED FILE

is a named collection of all occurrences of 1 type of stored record.

**Employee | Lname | Fname | Address | City | St | Zip | record and field names | Jones | John | 412 Elm | Fargo | ND| 58102 |**

record instance

| Smith | James | 415 Oak | | Thom | Bob | 12 Main | Mhd Mhd |MN| 56560 |

record instance

|MN| 56560 |

record instance

| Trath | Phil | 234 12St | Fargo |ND | 58105 |

record instance

. . . EoF

End of File marker

Stored continued

The

employee file type

IS the common

employee record type

(+ possibly, some other type characteristics, e.g., max-#-records) In todays storage device world, there is only

linear storage space

, so the 2-D picture of a stored file, strictly speaking, not possible in physical storage media today.

Some day there may be truly 2-D storage (e.g., holographic storage) and even 3-D.

A more accurately depiction of the store Employee file (as stored on linear storage):

| Jones | John | 412 Elm | Fargo | ND| 58102 || Smith | James | | 415 Oak | Mhd |MN| 56560 | | Thom | Bob | 12 Main | Mhd |MN| 56560 | | Trath | | Phil | 234 12St | Fargo |ND | 58105 | EoF |

How these entities are

stored

and how they are be known to the users in various

logical

viewed

or known to users may differ. They may variations. A

logical record

based on the 1

occuring employee record above might be:

| Jones | John | Fargo | ND|

So we also have

LOGICAL FIELD

= smallest unit of logical data

LOGICAL RECORD

= named collection of related logical fields.

LOGICAL FILE

= named collection of occurrences of 1 type of logical record which may or may not correspond to the physical entities.

DATA MODELS

For conceptualizing (logically) and storing (physically) data in a database we have horizontal and vertical models.

Here are some of the

HORIZONTAL MODELS

for files of horizontal records: (in which processing is typically done through vertical scans, e.g., Get and process 1 st record.

Get and process next record ...)

RELATIONAL

(simple flat unordered files or relations of records of tuples of unordered field values)

TABULAR INVERTED LIST

(ordered files of ordered fields) (Tabular with an access paths (index?) on every field)

HIERARCHICAL

(files with hierarchical links)

NETWORK

(files with record chains)

OBJECT-RELATIONAL

(Relational with "Large OBject" (LOBs) fields) (attributes which point to or contain complex objects).

RELATIONAL DATA MODEL

The only construct allowed is a [simple, flat]

relation

for both

entity description

and

relationship definition

STUDENT

Here are some of the VERTICAL MODELS BINARY STORAGE MODEL

|8 |DSDE |ND |

VIPER STRUCTURES PREDICATE-Trees or Ptrees

(~1998) (Used vertical bit vectors for data mining.) |17|BAID |NY2091|

COURSE

(for vertical vectors or trees of attribute values, processing is typically through

logical

C# CNAME SITE

|6 |3UA |NJ | |5 |3UA |ND | horizontal

The STUDENT and COURSE relations represent

entities

|57|BROWN |NY2092|

ENROLL S# C# GRADE

|32|8 | 89 | |32|7 | 91 | |25|7 | 68 | |25|6 | 76 | |32|6 | 62 |

LOCATION LCODE STATUS

|NJ5101| 1 | |NJ5102| 1 | |FL6321| 4 | |NY2091| 3 | |NY2092| 3 |

The LOCATION relation represents a

relationship

between the LCODE and STATUS attributes (1-to-many).

|38|6 | 98 | |17|5 | 96 | The ENROLL relations represents a

relationship

between Student and Course entities (a many-many relationship)

HIERARCHICAL DATA MODEL entities=records

relationships=links of records forming trees

EX:

root type is

STUDENT

dependent type is

COURSE

(with attributes S#, NAME, LOCATION), (with attributes C#, CNAME),

2nd-level dependent type

ENROLLMENT

(with attributes, GRADE, LOC)

25|CLAY|OTBK 32|THAISZ|KNB 38|GOOD|GTR

STUDENTS

7|CUS 6|3UA 8|DSDE 7|CUS 6|3UA 6|3UA

COURSES

ND|68 NJ|76 ND|89 ND|62 ND|91 NJ|98

ENROLLMENTS

If the typical workload involves producing

class lists for students

, this organization is

very good

. Why?

If the typical workload is producing

course enrollment lists for professors

, this is

very poor

. Why?

The problem with the Hierarchical Data Model is that it almost always favors a particular workload category (at the expense of the others).

NETWORK DATA MODEL entities = records

relationships = owner-member chains (sets) many-to-many relationships easily accomodated EX: 3 entities ( STUDENT ENROLLMENT COURSE )

2 owner-member chains:

STUDENT-ENROLLMENT 25| CLAY | MJ511 32 | THAISZ | NJ512 COURSE-ENROLLMENT

STUDENT records

68 76 89 91 62

ENROLLMENT records

8|DSDE|ND 7|CUS |ND 6|3UA |NJ

COURSE records

Easy to insert (create new record and reset pointers), delete (reset pointers), update (always just 1 copy to worry about,

ZERO REDUNDANCY

) network approach: fast processing, complicated structure (usually requires data processing shop) Again, it favors one workload type over others. INVERTED LIST MODEL (TABULAR): Flat Ordered Files (like relational except there's intrinsic order visible to user programs on both tuples and attributes). Order is usually "arrival order", meaning each record is given a unique "Relative Record Number" or RRN when it is inserted. - RRNs never change (unless there is a reorganization).

Programs can access records by RRN. Physical placement of records on pages is in RRN order ("clustered on RRN" so that application programs can efficiently retrieve in RRN order. Indexes, etc can be provided for other access paths (and orderings). page1 RRN S# ST | 0 | 25 |NJ| | 1 | 32 |NJ| STATE-INDEX RID STATE |1,2| FL | | 2 | 38 |FL| | 3 | 47 |NY| |1,0| NJ | |1,1| NJ | |1,3| NY | page2 |2,0| NY | | 0 | 57 |NY| | | | | | | | | Object Relational Model (OR model) is like relational model except repeating groups are allowed (many levels of repeating groups even nested repeating groups) and Pointers to very complex structures are allowed. (LOBs for Large OBjects, BLOBs for Binary Large OBjects, etc. for storing, e.g., pictures, movies, and other binary large objects.