Sections 13.1 – 13.3

Download Report

Transcript Sections 13.1 – 13.3

Cs257 Summary

By Swathi Vegesna 217

SECONDARY STORAGE MANAGEMENT

Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin

Sections 13.1 – 13.3

13.1.1 Memory Hierarchy

• • • Data storage capacities varies for different data Cost per byte to store data also varies Device with smallest capacity offer the fastest speed with highest cost per bit

Memory Hierarchy Diagram

Programs, DBMS Main Memory DBMS’s Tertiary Storage As Visual Memory Disk File System Main Memory Cache

13.1.1 Memory Hierarchy

• Cache – Lowest level of the hierarchy – Data items are copies of certain locations of main memory – Sometimes, values in cache are changed and corresponding changes to main memory are delayed – Machine looks for instructions as well as data for those instructions in the cache – – Holds limited amount of data No need to update the data in main memory immediately in a single processor computer – In multiple processors data is updated immediately to main memory….called as write through

Main Memory

• • • Refers to physical memory that is internal to the computer. The word main is used to distinguish it from external mass storage devices such as disk drives.

Everything happens in the computer i.e. instruction execution, data manipulation, as working on information that is resident in main memory Main memories are random access….one can obtain any byte in the same amount of time

Secondary storage

• • • • Used to store data and programs when they are not being processed More permanent than main memory, as data and programs are retained when the power is turned off A personal computer might only require 20,000 bytes of secondary storage E.g. magnetic disks, hard disks

Tertiary Storage

• • • • consists of anywhere from one to several storage drives.

It is a comprehensive computer storage system that is usually very slow, so it is usually used to archive data that is not accessed frequently. Holds data volumes in terabytes Used for databases much larger than what can be stored on disk

13.1.2 Transfer of Data Between levels

• • • • • • Data moves between adjacent levels of the hierarchy At the secondary or tertiary levels accessing the desired data or finding the desired place to store the data takes a lot of time Disk is organized into bocks Entire blocks are moved to and from memory called a buffer A key technique for speeding up database operations is to arrange the data so that when one piece of data block is needed it is likely that other data on the same block will be needed at the same time Same idea applies to other hierarchy levels

13.1.3 Volatile and Non Volatile Storage

• • • • A volatile device forgets what data is stored on it after power off Non volatile holds data for longer period even when device is turned off Secondary and tertiary devices are non volatile Main memory is volatile

13.1.4 Virtual Memory

• • • • • computer system technique which gives an application program the impression that it has contiguous working memory (an address space), while in fact it may be physically fragmented and may even overflow on to disk storage technique make programming of large applications easier and use real physical memory (e.g. RAM) more efficiently Typical software executes in virtual memory Address space is typically 32 bit or 2 32 bytes or 4GB Transfer between memory and disk is in terms of blocks

13.2.1 Mechanism of Disk

• Mechanisms of Disks – Use of secondary storage is one of the important characteristic of DBMS – Consists of 2 moving pieces of a disk • 1. disk assembly • 2. head assembly – Disk assembly consists of 1 or more platters – – Platters rotate around a central spindle Bits are stored on upper and lower surfaces of platters

13.2.1 Mechanism of Disk

• • • • Disk is organized into tracks The track that are at fixed radius from center form one cylinder Tracks are organized into sectors Tracks are the segments of circle separated by gap

13.2.2 Disk Controller

• • One or more disks are controlled by disk controllers Disks controllers are capable of – Controlling the mechanical actuator that moves the head assembly – Selecting the sector from among all those in the cylinder at which heads are positioned – Transferring bits between desired sector and main memory – Possible buffering an entire track

13.2.3 Disk Access Characteristics

• Accessing (reading/writing) a block requires 3 steps – Disk controller positions the head assembly at the cylinder containing the track on which the block is located. It is a ‘seek time’ – The disk controller waits while the first sector of the block moves under the head. This is a ‘rotational latency’ – All the sectors and the gaps between them pass the head, while disk controller reads or writes data in these sectors. This is a ‘transfer time’

13.3 Accelerating Access to Secondary Storage

 Secondary storage definition  Several approaches for more-efficiently accessing data in secondary storage:  Place blocks that are together in the same cylinder.  Divide the data among multiple disks.  Mirror disks.  Use disk-scheduling algorithms.  Prefetch blocks into main memory.  Scheduling Latency – added delay in accessing data caused by a disk scheduling algorithm.  Throughput – the number of disk accesses per second that the system can accommodate.

13.3.1 The I/O Model of Computation

 The number of block accesses (Disk I/O’s) is a good time approximation for the algorithm.  Disk I/o’s proportional to time taken, so should be minimized.  Ex 13.3: You want to have an index on R to identify the block on which the desired tuple appears, but not where on the block it resides.  For Megatron 747 (M747) example, it takes 11ms to read a 16k block.  delay in searching for the desired tuple is negligible.

13.3.2 Organizing Data by Cylinders

 first seek time and first rotational latency can never be neglected .

 Ex 13.4: We request 1024 blocks of M747.  If data is randomly distributed, average latency is 10.76ms by Ex 13.2, making total latency 11s.  If all blocks are consecutively stored on 1 cylinder:  6.46ms + 8.33ms * 16 = 139ms (1 average seek) (time per rotation) (# rotations)

13.3.3 Using Multiple Disks

Number of disks is proportional to the factor by which performance is

performance will increase by improved  Striping – distributing a relation across multiple disks following this pattern:  Data on disk R 1 : R 1 , R 1+n , R 1+2n ,…  Data on disk R 2 : R 2 , R 2+n , R 2+2n ,… …  Data on disk R n : R n , R n+n , R n+2n , …  Ex 13.5: We request 1024 blocks with n = 4.  6.46ms + (8.33ms * (16/4)) = 39.8ms

(1 average seek) (time per rotation) (# rotations)

13.3.4 Mirroring Disks

 Mirroring Disks – having 2 or more disks hold identical copy of data.  Benefit 1: If n disks are mirrors of each other, the system can survive a crash by n-1 disks.  Benefit 2: If we have n disks, read performance increases by a factor of n.  Performance increases =>increasing efficiency

13.3.5 Disk Scheduling and the Elevator Problem

 Disk controller will run this algorithm to select which of several requests to process first.  Pseudo code:  requests[] // array of all non-processed data requests  upon receiving new data request:  requests[].add(new request)  while(requests[] is not empty)  move head to next location  if(head is at data in requests[])  retrieves data  removes data from requests[]  if(head reaches end)  reverses head direction

13.3.5 Disk Scheduling and the Elevator Problem (con’t)

Events:

Head starting point Request data at 8000 Request data at 24000 Request data at 56000 Get data at 8000 Request data at 16000 Get data at 24000 Request data at 64000 Get data at 56000 Request Data at 40000 Get data at 64000 Get data at 40000 Get data at 16000 64000 56000 48000 40000 32000 24000 16000 8000

data

8000 ..

24000 ..

56000 ..

64000 ..

40000 ..

16000 ..

time

13.6

34.2

56.8

13.3.5 Disk Scheduling and the Elevator Problem (con’t)

Elevator Algorithm data

8000 ..

24000 ..

56000 ..

64000 ..

40000 ..

16000 ..

time

4.3

13.6

26.9

34.2

45.5

56.8

FIFO Algorithm data time

8000 ..

24000 ..

56000 ..

16000 ..

64000 ..

40000 ..

4.3

13.6

26.9

42.2

59.5

70.8

13.3.6 Prefetching and Large-Scale Buffering

 If at the application level, we can predict the order blocks will be requested, we can load them into main memory before they are needed.  This even reduces the cost and even save the time

13.4.Disk Failures

Intermittent Error: Read or write is unsuccessful.

If we try to read the sector but the correct content of that sector is not delivered to the disk controller. Check for the good or bad sector. To check write is correct: Read is performed.

Good sector and bad sector is known by the read operation.

Checksums: Each sector has some additional bits, called the checksums. They are set on the depending on the values of the data bits stored in that sector. Probability of reading bad sector is less if we use checksums. For Odd parity: Odd number of 1’s, add a parity bit 1. For Even parity: Even number of 1’s, add a parity bit 0. So, number of 1’s becomes always even.

Example: 1. Sequence : 01101000-> odd no of 1’s parity bit: 1 -> 011010001 2. Sequence : 111011100->even no of 1’s parity bit: 0 -> 111011100

• Stable -Storage Writing Policy: To recover the disk failure known as Media Decay, in which if we overwrite a file, the new data is not read correctly. Sectors are paired and each pair is said to be X, having left and right copies as Xl and Xr respectively and check the parity bit of left and right by substituting spare sector of Xl and Xr until the good value is returned.

• • • The term used for these strategies is RAID or Redundant Arrays of Independent Disks.

Mirroring: Mirroring Scheme is referred as RAID level 1 protection against data loss scheme. In this scheme we mirror each disk. One of the disk is called as data disk and other redundant

disk

.

In this case the only way data can be lost is if there is a second disk crash while the first crash is being repaired.

Parity Blocks: RAID level 4 scheme uses only one redundant disk no matter how many data disks there are. In the redundant disk, the ith block consists of the parity checks for the ith blocks of all the data disks. It means, the jth bits of all the ith blocks of both data disks and redundant disks, must have an even number of 1’s and redundant disk bit is used to make this condition true.

• • • • Failures: If out of Xl and Xr, one fails, it can be read form other, but in case both fails X is not readable, and its probability is very small Write Failure: During power outage, 1. While writing Xl, the Xr, will remain good and X can be read from Xr 2. After writing Xl, we can read X from Xl, as Xr may or may not have the correct copy of X.

• Recovery from Disk Crashes: To reduce the data loss by Dish crashes, schemes which involve redundancy, extending the idea of parity checks or duplicate sectors can be applied.

• Parity Block – Writing When we write a new block of a data disk, we need to change that block of the redundant disk as well.

• One approach to do this is to read all the disks and compute the module-2 sum and write to the redundant disk.

But this approach requires n-1 reads of data, write a data block and write of redundant disk block. Total = n+1 disk I/Os

• RAID 5 RAID 4 is effective in preserving data unless there are two simultaneous disk crashes.

Whatever scheme we use for updating the disks, we need to read and write the redundant disk's block. If there are n data disks, then the number of disk writes to the redundant disk will be n times the average number of writes to any one data disk.

However we do not have to treat one disk as the redundant disk and the others as data disks. Rather, we could treat each disk as the redundant disk for some of the blocks. This improvement is often called RAID level 5.

13.5

Arranging data on disk

• Data elements are represented as records, which stores in consecutive bytes in same same disk block. Basic layout techniques of storing data : Fixed-Length Records Allocation criteria - data should start at word boundary.

Fixed Length record header 1. A pointer to record schema.

2. The length of the record.

3. Timestamps to indicate last modified or last read.

Example CREATE TABLE employee( name CHAR(30) PRIMARY KEY, address VARCHAR(255), gender CHAR(1), birthdate DATE ); Data should start at word boundary and contain header and four fields name, address, gender and birthdate.

• Packing Fixed-Length Records into Blocks Records are stored in the form of blocks on the disk and they move into main memory when we need to update or access them.

A block header is written first, and it is followed by series of blocks.

Block header contains the following information: Links to one or more blocks that are part of a network of blocks.

Information about the role played by this block in such a network.

Information about the relation, the tuples in this block belong to.

A "directory" giving the offset of each record in the block.

Time stamp(s) to indicate time of the block's last modification and/or access Along with the header we can pack as many record as we can Along with the header we can pack as many record as we can in one block as shown in the figure and remaining space will be unused.

13.6 Representing Block and Record Addresses • • Address of a block and Record – In Main Memory • Address of the block is the virtual memory address of the first byte • Address of the record within the block is the virtual memory address of the first byte of the record – In Secondary Memory: sequence of bytes describe the location of the block in the overall system Sequence of Bytes describe the location of the block : the device Id for the disk, Cylinder number, etc.

• • • Addresses in Client-Server Systems The addresses in address space are represented in two ways – Physical Addresses: byte strings that determine the place within the secondary storage system where the record can be found.

– Logical Addresses: arbitrary string of bytes of some fixed length Physical Address bits are used to indicate: – Host to which the storage is attached – Identifier for the disk – Number of the cylinder – Number of the track – Offset of the beginning of the record

• Map Table relates logical addresses to physical addresses .

Logical A ddress

Logical Physical

Physical Address

• • Logical and Structured Addresses Purpose of logical address?

Gives more flexibility, when we – Move the record around within the block – Move the record to another block Gives us an option of deciding what to do when a record is deleted?

Pointer Swizzling Having pointers is common in an object-relational database systems Important to learn about the management of pointers

Every data item (block, record, etc.) has two addresses:

database address: address on the disk

memory address, if the item is in virtual memory

• Translation Table: Maps database address to memory address

Dbaddr Mem-addr

Database address • Memory Address All addressable items in the database have entries in the map table, while only those items currently in memory are mentioned in the translation table

• Pointer consists of the following two fields – – – Bit indicating the type of address Database or memory address Example 13.17

Disk Block 1 Memory Swizzled Block 1 Unswizzled Block 2

• • Example 13.7

Block 1 has a record with pointers to a second record on the same block and to a record on another block If Block 1 is copied to the memory – The first pointer which points within Block 1 can be swizzled so it points directly to the memory address of the target record – Since Block 2 is not in memory, we cannot swizzle the second pointer Three types of swizzling – Automatic Swizzling • As soon as block is brought into memory, swizzle all relevant pointers.

• – Swizzling on Demand • Only swizzle a pointer if and when it is actually followed.

– No Swizzling • Pointers are not swizzled they are accesses using the database address.

Unswizzling – When a block is moved from memory back to disk, all pointers must go back to database (disk) addresses – Use translation table again – Important to have an efficient data structure for the translation table

• • • Pinned records and Blocks A block in memory is said to be pinned if it cannot be written back to disk safely.

If block B1 has swizzled pointer to an item in block B2, then B2 is pinned – Unpin a block, we must unswizzle any pointers to it – Keep in the translation table the places in memory holding swizzled pointers to that item – Unswizzle those pointers (use translation table to replace the memory addresses with database (disk) addresses

13.7 Records With Variable-Length Fields A simple but effective scheme is to put all fixed length fields ahead of the variable-length fields. We then place in the record header: 1. The length of the record.

2. Pointers to (i.e., offsets of) the beginnings of all the variable length fields. However, if the variable-length fields always appear in the same order then the first of them needs no pointer; we know it immediately follows the fixed-length fields.

• • Records With Repeating Fields A similar situation occurs if a record contains a variable number of Occurrences of a field F, but the field itself is of fixed length. It is sufficient to group all occurrences of field F • together and put in the record header a pointer to the first.

We can locate all the occurrences of the field F as follows. Let the number of bytes devoted to one instance of field F be L.

We then add to the offset for the field F all integer • multiples of L, starting at 0, then L, 2L, 3L, and so on. Eventually, we reach the offset of the field following F.

Where upon we stop.

• • An alternative representation is to keep the record of fixed length, and put the variable length portion - be it fields of variable length or fields that repeat an indefinite number of times - on a separate block. In the record itself we keep: 1. Pointers to the place where each repeating field begins, and 2.

Either how many repetitions there are, or where the repetitions end.

• • Variable-Format Records The simplest representation of variable-format records is a sequence of tagged fields, each of which consists of:

1. Information about the role of this field, such as:

(a) The attribute or field name, (b) The type of the field, if it is not apparent from the field name and some readily available schema information, and (c) The length of the field, if it is not apparent from the type.

2. The value of the field.

There are at least two reasons why tagged fields would make sense.

1. Information integration applications - Sometimes, a relation has been constructed from several earlier sources, and these sources have different kinds of information For instance, our movie star information may have come from several sources, one of which records birthdates, some give addresses, others not, and so on. If there are not too many fields, we are probably best off leaving NULL those values we do not know.

2. Records with a very flexible schema - If many fields of a record can repeat and/or not appear at all, then even if we know the schema, tagged fields may be useful. For instance, medical records may contain information about many tests, but there are thousands of possible tests, and each patient has results for relatively few of them

• These large values have a variable length, but even if the length is fixed for all values of the type, we need to use some special techniques to represent these values. In this section we shall consider a technique called “spanned records" that can be used to manage records that are larger than blocks.

• Spanned records also are useful in situations where records are smaller than blocks, but packing whole records into blocks wastes significant amounts of space.

For both these reasons, it is sometimes desirable to allow records to be split across two or more blocks. The portion of a record that appears in one block is called a record fragment. If records can be spanned, then every record and record fragment requires some extra header information:

1. Each record or fragment header must contain a bit telling whether or not it is a fragment.

2. If it is a fragment, then it needs bits telling whether it is the first or last fragment for its record.

3.

If there is a next and/or previous fragment for the same record, then the fragment needs pointers to these other fragments.

Storing spanned records across blocks:

• • • • •

BLOBS

Binary, Large OBjectS = BLOBS BLOBS can be images, movies, audio files and other very large values that can be stored in files.

Storing BLOBS – Stored in several blocks.

Preferable to store them consecutively on a cylinder or multiple disks for efficient retrieval.

Retrieving BLOBS – A client retrieving a 2 hour movie may not want it all at the same time.

– Retrieving a specific part of the large data requires an index structure to make it efficient. (Example: An index by seconds on a movie BLOB.)

• • Column Stores: An alternative to storing tuples as records is to store each column as a record. Since an entire column of a relation may occupy far more than a single block, these records may span many block, much as long as files do. If we keep the values in each column in the same order then we can reconstruct the relation from column records

• 

13.8

Insertion: Insertion of records without order Records can be placed in a block with empty space or in a new block.

Insertion of records in fixed order – Space available in the block – No space available in the block (outside the block) Structured address • Pointer to a record from outside the block.

Insertion in fixed order Space available within the block

Use of an offset table in the header of each block with pointers to

the location of each record in the block.

• The records are slid within the block and the pointers in the offset table are adjusted.

No space available within the block (outside the block)  Find space on a “nearby” block.

• • In case of no space available on a block, look at the following block in sorted order of blocks.

If space is available in that block ,move the highest records of first block 1 to block 2 and slide the records around on both blocks.

 • • Create an overflow block Records can be stored in overflow block.

Each block has place for a pointer to an overflow block in its header. The overflow block can point to a second overflow block as shown below.

  •  Deletion: Recover space after deletion When using an offset table, the records can be slid around the block so there will be an unused region in the center that can be recovered.

In case we cannot slide records, an available space list can be maintained in the block header.

The list head goes in the block header and available regions hold the links in the list.

Use of tombstone The tombstone is placed in a record in order to avoid pointers to the deleted record to point to new records.

The tombstone is permanent until the entire database is reconstructed.

• • • If pointers go to fixed locations from which the location of the record is found then we put the tombstone in that fixed location. (See examples) Where a tombstone is placed depends on the nature of the record pointers.

Map table is used to translate logical record address to physical address.

UPDATING RECORDS For Fixed-Length Records, there is no effect on the storage system For variable length records : • If length increases, like insertion “slide the records” • If length decreases, like deletion we update the space available list, recover the space/eliminate the overflow blocks.

Chapter:18

• • • • • 18.1 Serial and Serializable Schedule A process of assuming that the transactions preserve the consistency when executing simultaneously is called Concurrency Control.

This consistency is taken care by Scheduler .

Concurrency control in database management systems (DBMS) ensures that database transactions are performed concurrently without the concurrency violating the data integrity of a database.

Executed transactions should follow the ACID rules. The DBMS must guarantee that only serializable (unless Serializability is intentionally relaxed), recoverable schedules are generated.

• • • It also guarantees that no effect of committed transactions is lost, and no effect of aborted (rolled back) transactions remains in the related database.

ACID rules

Atomicity - Either the effects of all or none of its operations remain when a transaction is completed - in other words, to the outside world the transaction appears to be indivisible, atomic. • Consistency - Every transaction consistent state .

must leave the database in a • • Isolation - Transactions cannot interfere with each other. Providing isolation is the main goal of concurrency control. • Durability - Successful transactions must persist through crashes .

  In the field of databases, a schedule is a list of actions, (i.e. reading, writing, aborting, committing), from a set of transactions.

In this example, Schedule D is the set of 3 transactions T1, T2, T3. The schedule describes the actions of the transactions as seen by the DBMS. T1 Reads and writes to object X, and then T2 Reads and writes to object Y, and finally T3 Reads and writes to object Z. This is an example of a serial schedule, because the actions of the 3 transactions are not interleaved.

• • • Serial and Serializable Schedules: A schedule that is equivalent to a serial schedule has the serializability property.

In schedule E, the order in which the actions of the transactions are executed is not the same as in D, but in the end, E gives the same result as D.

Serial Schedule TI precedes T2

T1 T2 A 50 READ (A,t) t := t+100 WRITE (A,t) 150 READ (A,s) s := s*2 WRITE (A,s) 300 READ (B,t) t := t+100 WRITE (B,t) READ (B,s) s := s*2 WRITE (B,s) B 50 150 300

• Non-Serializable Schedule T1 T2 READ (A,t) t := t+100 WRITE (A,t) READ (A,s) s := s*2 WRITE (A,s) READ (B,s) s := s*2 WRITE (B,s) READ (B,t) t := t+100 WRITE (B,t) A 50 150 300 B 50 100 200

• A Serializable Schedule with details T1 T2 A 50 READ (A,t) t := t+100 WRITE (A,t) 150 READ (A,s) s := s*1 WRITE (A,s) READ (B,s) s := s*1 WRITE (B,s) 150 READ (B,t) t := t+100 WRITE (B,t) B 50 50 150

• • • • •

18.2 Conflict Serializability

Non-Conflicting Actions Two actions are non-conflicting if whenever they occur consecutively in a schedule, swapping them does not affect the final state produced by the schedule. Otherwise, they are conflicting .

Conflicting Actions: General Rules

Two actions of the same transaction conflict:

r1(A) w1(B)

Two actions over the same database element conflict, if one of them is a write – r1(A) w2(A) – w1(A) w2(A)

• Conflict Serializable: We may take any schedule and make as many nonconflicting swaps as we wish.

With the goal of turning the schedule into a serial schedule.

• • •

If we can do so, then the original schedule is serializable, because its effect on the database state remains the same as we perform each of the nonconflicting swaps.

A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to one or more serial schedules.

Another definition for conflict-serializability is that a schedule is conflict-serializable if and only if there exists an acyclic precedence graph/serializability graph for the schedule.

Which is conflict-equivalent to the serial schedule , but not .

• • Conflict equivalent / conflict-serializable Let Ai and Aj are consecutive non-conflicting actions that belongs to different transactions. We can swap Ai and Aj without changing the result.

Two schedules are conflict equivalent if they can be turned one into the other by a sequence of non-conflicting swaps of adjacent actions.

• We shall call a schedule conflict-serializable equivalent to a serial schedule if it is conflict Test for conflict-serializability Construct the precedence graph for S and observe if there are any cycles.

– If yes, then S is not conflict-serializable – Else, it is a conflict-serializable schedule.

• • Example of a cyclic precedence graph: – Consider the below schedule S 1 : r 2 (A); r 1 (B); w 2 (A); r 2 (B); r 3 (A); w 1 (B); w 3 (A); w 2 (B); Observing the actions of A in the previous example (figure 2), we can find that T 2

• • • But when we observe B, we get both T can conclude that S 1 1

• If there is a cycle involving n transactions T – 1 T 2 ..T

n

Then in the hypothetical serial order, the actions of T 1

– –

precede those of T 2

But actions of T n

which would precede those of T 3

are also required to precede those of T 1 T .

1

must ... up to n.

So, if there is a cycle in the graph, then we can conclude that the schedule is not conflict-serializable.

18.3 Enforcing Serializability by Locks:

• • Locks It works as follows : – – – A request from transaction Scheduler checks in the lock table Generates a serializable schedule of actions.

• • Consistency of transactions Actions and locks must relate each other – Transactions can only read & write only if has a lock and has not released the lock.

– Unlocking an element is compulsory.

• Legality of schedules – No two transactions can aquire the lock on same element without the prior one releasing it.

• Locking scheduler Grants lock requests only if it is in a legal schedule.

Lock table stores the information about current locks on the elements.

• A legal schedule of consistent transactions but unfortunately it is not a serializable.

• The locking scheduler delays requests that would result in an illegal schedule.

• • • • • Two-phase locking Guarantees a legal schedule of consistent transactions is conflict-serializable.

All lock requests proceed all unlock requests.

The growing phase: – Obtain all the locks and no unlocks allowed.

The shrinking phase: – Release all the locks and no locks allowed.

• Failure of 2PL

2PL fails to provide security against deadlocks.

• • • Shared & Exclusive Locks: Consistency of Transactions – Cannot write without Exclusive Lock – Cannot read without holding some lock This basically works on these principles – 1. Consistency of Transactions – A read action can only proceed a shared or an exclusive lock – A write lock can only proceed a exclusive lock –

All locks need to be unlocked before commit

2. Two-phase locking of transactions – Locking Must precede unlocking

3. Legality of Schedules – An element may be locked exclusively by one transaction or by several in shared mode, but not both

• • • • • Compatibility Matrices: Has a row and column for each lock mode.

– Rows correspond to a lock held on an element by another transaction – Columns correspond to mode of lock requested.

– Example :The column for S says that we can grant a shared lock on an element if the only locks held on that element currently are shared locks.

Upgrading Locks

Suppose a transaction wants to read as well as write :

It acquires a shared lock on the element

Performs the calculations on the element

And when its ready to write, It is granted a exclusive lock .

Transactions with unpredicted read write locks can use upgrading locks.

• • Indiscriminating use of upgrading produces a deadlock. (Limitation) Example : Both the transactions want to upgrade on the same element

18.4 Locking Systems with Several Lock Modes

Locking Scheme – Shared/Read Lock ( For Reading) – Exclusive/Write Lock( For Writing) • • • • Compatibility Matrices Upgrading Locks Update Locks Increment Locks 79

Shared & Exclusive Locks

• • • Consistency of Transactions – Cannot write without Exclusive Lock – Cannot read without holding some lock This basically works on 2 principles – A read action can only proceed a shared or an exclusive lock – A write lock can only proceed a exclusice lock

All locks need to be unlocked before commit

80

Shared and exclusive locks (cont.)

• • Two-phase locking of transactions – Must precede unlocking Legality of Schedules – An element may be locked exclusively by one transaction or by several in shared mode, but not both.

81

Compatibility Matrices

• Has a row and column for each lock mode.

– Rows correspond to a lock held on an element by another transaction – Columns correspond to mode of lock requested.

– Example : LOCK HOLD S X S

LOCK REQUESTED

X YES NO NO NO 82

Update locks

• • • • Solves the deadlock occurring in upgrade lock method.

A transaction in an update lock can read but cant write.

Update lock can later be converted to exclusive lock.

An update lock can only be given if the element has shared locks.

83

Increment Locks

• • Used for incrementing & decrementing stored values.

E.g. - Transfer money from one bank to another, Ticket selling transactions in which number seats are decremented after each transaction.

84

• • • • Increment lock

A increment lock does not enable either read or write locks on element.

Any number of transaction can hold increment lock on an element at any time.

Shared and exclusive locks cannot be granted if an increment lock is granted on element 85

18.5 Locking Scheduler • The order in which the individual steps of different transactions occur is regulated by the scheduler.

The general process of assuring that transactions preserve consistency when executing simultaneously is called concurrency control.

Architecture of a Locking Scheduler The transactions themselves do not request locks, or cannot be relied upon to do so. It is the job of the scheduler to insert lock actions into the stream of reads, writes and other actions that access data.

Transactions do not locks. Rather the scheduler releases the locks when the transaction manager tells it that the transaction will commit or abort.

Role of a Scheduler

Lock Table

• • • • Lock Table The lock table is a relation that associates database elements with locking information about that element.

The table is implemented with a hash table using database elements as a hash key.

Size of Lock Table

The size of the table is proportional to the number of locked elements only and not to the entire size of the database since any element that is not locked does not appear in the table.

Group Mode The group mode is a summary of the most stringent conditions that a transaction requesting a new lock on an element faces. Rather than comparing the lock request with every lock held by another transaction on the same element, we can simplify the grant/deny decision by comparing the request with only the group mode.

Structure of Lock Table Entries

• • • • • • • • Handling Lock Requests Suppose transaction T requests a lock on A. If there is no lock-table entry for A, then surely there are no locks on A, so the entry is created and the request is granted.

If the lock-table entry for A exists then we use it to guide the decision about the lock request.

Handling Unlocks If the value of waiting is ‘Yes’ then we need to grant one or more locks from the list of requested locks. The different approaches for this are: First-come-first-served Priority to shared locks Priority to upgrading

18.6 Managing Hierarchies of Database Elements It Focus on two problems that come up when there id tree structure to our data.

1. Tree Structure : Hierarchy of lockable elements. And How to allow locks on both large elements, like Relations and elements in it such as blocks and tuples of relation, or individual.

2. Another is data that is itself organized in a tree. A major example would be B-tree index. Locks With Multiple Granularity “Database Elements” : It is sometime noticeably the various elements which can be used for locking.

Eg: Tuples, Pages or Blocks, Relations etc.

Example: Bank database  Small granularity locks: Larger concurrency can achieved.

 Large granularity locks: Some times saves from unserializable behavior.

Warning locks The solution to the problem of managing locks at different granularities involves a new kind of lock called a “Warning.“ • It is helpful in hierarchical or nested structure .

• It involves both “ordinary” locks and “warning” locks.

• Ordinary locks: Shared(S) and Exclusive(X) locks.

• Warning locks: Intention to shared(IS) and Intention to Exclusive(IX) locks.

• Warning Protocols • • • • 1. To place an ordinary S or X lock on any element. we must begin at the root of the hierarchy.

2. If we are at the element that we want to lock, we need look no further. We request lock there only 3. If the element is down in hierarchy then place warning lock on that node respective of shared and exclusive locks and then Move on to appropriate child and then try steps 2 or 3 and until you go to desired node and then request shared or exclusive lock.

IS IX S X IS

YES YES YES NO Compatibility Matrix

IX

YES YES NO NO

S

YES N O YES NO

X

NO NO NO NO IS column: Conflicts only on X lock.

IX column: Conflicts on S and X locks.

S column: Conflicts on X and IX locks.

X column: Conflicts every locks.

Warning Protocols Transaction2(T2):

Consider the relation: M o v i e ( t i t l e , year, length, studioName)

Transaction1 (T1): SELECT * FROM Movie WHERE title = 'King Kong'; UPDATE Movie SET year = 1939 WHERE title = 'Gone With the Wind';

• • • • • • •

18.7 The Tree Protocol

ADVANTAGES OF TREE PROTOCOL

Unlocking takes less time as compared to 2PL Freedom from deadlocks

18.7.1 MOTIVATION FOR TREE-BASED LOCKING

Consider B-Tree Index, treating individual nodes as lockable database elements.

Concurrent use of B-Tree is not possible with standard set of locks and 2PL.

Therefore, a protocol is needed which can assure serializability by allowing access to the elements all the way at the bottom of the tree even if the 2PL is violated

18.7.2 ACCESSING TREE STRUCTURED DATA Assumptions:

• Only one kind of lock • Consistent transactions • Legal schedules • No 2PL requirement on transaction • • • •

Rules:

First lock can be at any node.

Subsequent locks may be acquired only after parent node has a lock.

Nodes may be unlocked any time.

No relocking of the nodes even if the node’s parent is still locked

• • •

18.7.3 WHY TREE PROTOCOL WORKS?

Tree protocol implies a serial order on transactions in the schedule.

Order of precedence:

Ti < s Tj

If Ti locks the root before Tj, then Ti locks every node in common with Tj before Tj.

ORDER OF PRECEDENCE

• • • • •

18.8 Concurrency control by Timestamps What is Timestamping?

Scheduler assign each transaction T a unique number, it’s timestamp TS(T).

Timestamps must be issued in ascending order, at the time when a transaction first notifies the scheduler that it is beginning. Two methods of generating Timestamps.

– Use the value of system, clock as the timestamp.

– Use a logical counter that is incremented after a new timestamp has been assigned.

Scheduler maintains a table of currently active transactions and their timestamps irrespective of the method used

• • • • • •

Timestamps for database element X and commit bit

RT(X):- The read time of X, which is the highest timestamp of transaction that has read X.

WT(X):- The write time of X, which is the highest timestamp of transaction that has write X.

C(X):- The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.

Physically Unrealizable Behavior

Read too late: A transaction U that started after transaction T, but wrote a value for X before T reads X.

U reads X T writes X T start U start

Figure: Transaction T tries to write too late

Physically Unrealizable Behavior

• • • Write too late A transaction U that started after T, but read X before T got a chance to write X.

Dirty Read

It is possible that after T reads the value of X written by U, transaction U will abort.

U reads X T writes X T start U start

Figure: Transaction T tries to write too late

U writes X T reads X U start T start U aborts

T could perform a dirty read if it reads X when shown

Rules for Timestamps-Based scheduling

1. Scheduler receives a request rT(X) a) If TS(T) ≥ WT(X), the read is physically realizable.

1. If C(X) is true, grant the request, if TS(T) > RT(X), set RT(X) := TS(T); otherwise do not change RT(X).

2. If C(X) is false, delay T until C(X) becomes true or transaction that wrote X aborts.

b) If TS(T) < WT(X), the read is physically unrealizable. Rollback T.

2

. Scheduler receives a request WT(X).

a) if TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is physically realizable and must be performed.

1. Write the new value for X, 2. Set WT(X) := TS(T), and 3. Set C(X) := false.

b) if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later values in X. a. If C(X) is true, then the previous writers of X is committed, and ignore the write by T.

b. If C(X) is false, we must delay T.

c) if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.

3. Scheduler receives a request to commit T. It must find all the database elements X written by T and set C(X) := true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed.

4. Scheduler receives a request to abort T or decides to rollback T, then any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.

• • • • • • • •

Multiversion Timestamps

Multiversion schemes keep old versions of data item to increase concurrency.

Each successful write results in the creation of a new version of the data item written.

Use timestamps to label versions.

When a read(X) operation is issued, select an appropriate version of X based on the timestamp of the transaction, and return the value of the selected version.

Timestamps and Locking

Generally, timestamping performs better than locking in situations where: – – Most transactions are read-only.

It is rare that concurrent transaction will try to read and write the same element.

In high-conflict situation, locking performs better than timestamps

Concurrency Control by Validation(18.9)

By Swathi Vegesna 217

At a Glance

 Introduction  Validation based scheduling  Validation based Scheduler  Expected exceptions  Validation rules  Example  Comparisons  Summary

Introduction

What is optimistic concurrency control?

• ( assumes no unserializable behavior will occur ) Timestamp- based scheduling and • Validation-based scheduling ( allows T to access data without locks )

Validation based scheduling

 Scheduler keeps a record of what the active transactions are doing.

 Executes in 3 phases 1. Read reads from RS( ), computes local address 2. Validate compares read and write sets 3. Write writes from WS( )

Validation based Scheduler

 Contains an assumed serial order of transactions.  Maintains three sets: – START( ): set of T’s started but not completed validation.

– VAL( ): set of T’s validated but not finished the writing phase.

– FIN( ): set of T’s that have finished.

Expected exceptions

1. Suppose there is a transaction U, such that:  U is in VAL or FIN; that is, U has validated,   FIN(U)>START(T); that is, U did not finish before T started RS(T)

WS(T) ≠φ; let it contain database element X.

• 2. Suppose there is transaction U, such that: U is in VAL; U has successfully validated.

• FIN(U)>VAL(T); U did not finish before T entered its validation phase.

• WS(T)

WS(U) ≠φ; let x be in both write sets.

Validation rules

 Check that RS(T)

WS(U)= φ for any previously validated U that did not finish before T has started i.e. FIN(U)>START(T).

 Check that WS(T)

WS(U)= φ for any previously validated U that did not finish before T is validated i.e. FIN(U)>VAL(T)

Example

Solution

    Validation of U: Nothing to check Validation of T: WS(U) ∩ RS(T)= {D} ∩{A,B}=φ WS(U) ∩ WS(T)= {D}∩ {A,C}=φ Validation of V: RS(V) ∩ WS(T)= {B}∩{A,C}=φ WS(V) ∩ WS(T)={D,E}∩ {A,C}=φ RS(V) ∩ WS(U)={B} ∩{D}=φ Validation of W: RS(W) ∩ WS(T)= {A,D}∩{A,C}={A} WS(W) ∩ WS(V)= {A,D}∩{D,E}={D} WS(W) ∩ WS(V)= {A,C}∩{D,E}=φ (W is not validated)

Comparison

Concurrency control Mechanisms

Locks Timestamps Validation

Storage Utilization

Space in the lock table is proportional to the number of database elements locked.

Space is needed for read and write times with every database element, neither or not it is currently accessed.

Delays

Delays transactions but avoids rollbacks Do not delay the transactions but cause them to rollback unless Interface is low Space is used for timestamps and read or write sets for each currently active transaction, plus a few more transactions that finished after some currently active transaction began.

Do not delay the transactions but cause them to rollback unless interface is low

21.1 Introduction to Information Integration

CS257 Fan Yang

Need for Information Integration

• • • All the data in the world could put in a single database (ideal database system) Databases In are created independently hard to design a database to support future use The use of databases evolves, so we can not design a database to support every possible future use.

University Database

• • • • Registrar: to record student and grade Bursar: to record tuition payments by students Human Resources Department: to record employees Applications were build using these databases like generation of payroll checks, calculation of taxes and social security payments to government.

Inconvenient

• • • • change in 1 database would not reflect in the other database which had to be performed manually.

Record grades for students who pay tuition Want to swim in SJSU aquatic center for free in summer vacation?

(all the cases above cannot achieve the function by a single database) Solution: one database

How to integrate

• Start over build one database: contains all the legacy databases; rewrite all the applications • result: painful Build a layer of abstraction (middleware) on top of all the legacy databases this layer is often defined by a collection of classes BUT…

• When we try to connect information sources that were developed independently, we invariably find that sources differ in many ways. Such sources are called Heterogeneous, and the problem of integrating them is referred to as the Heterogeneity Proble

m.

Heterogeneity Problem

• What is Heterogeneity Problem Aardvark Automobile Co. 1000 dealers has 1000 databases to find a model at another dealer can we use this command: SELECT * FROM CARS WHERE MODEL=“A6”;

Type of Heterogeneity

• • • • • • Communication Heterogeneity Query-Language Heterogeneity Schema Heterogeneity Data type difference Value Heterogeneity Semantic Heterogeneity

Communication Heterogeneity

• • Today, it is common to allow access to your information using HTTP protocols. However, some dealers may not make their databases available on net, but instead accept remote accesses via anonymous FTP.

Suppose there are 1000 dealers of Aardvark Automobile Co. out of which 900 use HTTP while the remaining 100 use FTP, so there might be problems of communication between the dealers databases.

Query Language Heterogeneity

• • The manner in which we query or modify a dealer’s database may vary.

For e.g. Some of the dealers may have different versions of database like some might use relational database some might not have relational database, or some of the dealers might be using SQL, some might be using Excel spreadsheets or some other database.

Schema Heterogeneity

• • Even assuming that the dealers use a relational DBMS supporting SQL as the query language there might be still some heterogeneity at the highest level like schemas can differ.

For e.g. one dealer might store cars in a single relation while the other dealer might use a schema in which options are separated out into a second relation.

Data type Diffrences

• Serial Numbers might be represented by a character strings of varying length at one source and fixed length at another. The fixed lengths could differ, and some sources might use integers rather than character strings.

Value Heterogeneity

• The same concept might be represented by different constants at different sources. The color Black might be represented by an integer code at one source, the string BLACK at another, and the code BL at a third.

Semantic Heterogeneity

• Terms might be given different interpretations at different sources. One dealer might include trucks in

Cars

relation, while the another puts only automobile data in Cars relation. One dealer might distinguish station wagons from the minivans, while another doesn’t.

Chapter 21.2

Modes of Information Integration

ID: 219 Name: Qun Yu Class: CS257 219 Spring 2009 Instructor: Dr. T.Y.Lin

Federations

 The simplest architecture for integrating several DBs  One to one connections between all pairs of DBs  n DBs talk to each other, n(n-1) wrappers are needed  Good when communications between DBs are limited

Wrapper

• –

Wrapper

: a software translates incoming queries and outgoing answers. allows information sources to conform to some shared schema.

Federations Diagram

DB1 DB2 2 Wrappers 2 Wrappers 2 Wrappers 2 Wrappers 2 Wrappers 2 Wrappers DB3 DB4

A federated collection of 4 DBs needs 12 components to translate queries from one to another.

Example

Car dealers want to share their inventory. Each dealer queries the other’s DB to find the needed car.

Dealer-1’s DB relation: NeededCars(model,color,autoTrans) Dealer-2’s DB relation: Auto(Serial, model, color) Options(serial,option) Dealer-1’s DB wrapper wrapper Dealer-2’s DB

Example…

} For(each tuple(:m,:c,:a) in NeededCars){ if(:a=TRUE){/* automatic transmission wanted */ SELECT serial FROM Autos, Options WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’ AND Autos.model = :m AND Autos.color =:c; } Else{/* automatic transmission not wanted */ SELECT serial FROM Auto } WHERE Autos.model = :m AND Autos.color = :c AND NOT EXISTS( SELECT * FROM Options WHERE serial = Autos.serial AND option=‘autoTrans’);

Dealer 1 queries Dealer 2 for needed cars

Data Warehouse

   Sources are translated from their local schema to a global schema and copied to a central DB.

User transparent: user uses Data Warehouse just like an ordinary DB User is not allowed to update Data Warehouse

Warehouse Diagram

User query result Warehouse Combiner Extractor Source 1 Extractor Source 2

Example

Construct a data warehouse from sources DB of 2 car dealers: Dealer-1’s schema: Cars(serialNo, model,color,autoTrans,cdPlayer,…) Dealer-2’s schema: Auto(serial,model,color) Options(serial,option) Warehouse’s schema: AutoWhse(serialNo,model,color,autoTrans,dealer) Extractor --- Query to extract data from Dealer-1’s data: INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer) SELECT serialNo,model,color,autoTrans,’dealer1’ From Cars;

Example

Extractor --- Query to extract data from Dealer-2’s data: INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer) SELECT serialNo,model,color,’yes’,’dealer2’ FROM Autos,Options WHERE Autos.serial=Options.serial AND option=‘autoTrans’; INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer) SELECT serialNo,model,color,’no’,’dealer2’ FROM Autos WHERE NOT EXISTS ( SELECT * FROM serial =Autos.serial AND option = ‘autoTrans’);

Construct Data Warehouse

1) There are mainly 3 ways to constructing the data in the warehouse: Periodically reconstructed from the current data in the sources, once a night or at even longer intervals.

Advantages: simple algorithms.

Disadvantages: 1) need to shut down the warehouse; 2) data can become out of date.

Construct Data Warehouse

2) Updated periodically based on the changes(i.e. each night) of the sources.

Advantages: involve smaller amounts of data.

(important when warehouse is large and needs to be modified in a short period) Disadvantages: 1) the process to calculate changes to the warehouse is complex.

2) data can become out of date.

Construct Data Warehouse

3) Changed immediately, in response to each change or a small set of changes at one or more of the sources.

Advantages: data won’t become out of date. Disadvantages: requires too much communication, therefore, it is generally too expensive.

(practical for warehouses whose underlying sources changes slowly.)

Mediators

   Virtual warehouse, which supports a virtual view or a collection of views, that integrates several sources.

Mediator doesn’t store any data.

Mediators’ tasks: 1)receive user’s query, 2)send queries to wrappers, 3)combine results from wrappers, 4)send the final result to user.

A Mediator diagram

User query Result Query Query Wrapper Result Mediator Result Result Query Query Wrapper Result Source 1 Source 2

Example

Same data sources as the example of data warehouse, the mediator Integrates the same two dealers’ source into a view with schema: AutoMed(serialNo,model,color,autoTrans,dealer) When the user have a query: SELECT sericalNo, model FROM AkutoMed Where color=‘red’ In this simple case, the mediator forwards the same query to each Of the two wrappers.

Wrapper1: FROM cars Cars(serialNo, model, color, autoTrans, cdPlayer, …) SELECT serialNo,model WHERE color = ‘red’; Wrapper2: Autos(serial,model,color); Options(serial,option) SELECT serial , model FROM Autos WHERE color=‘red’;

Example

There may be different options for the mediator to forward user query, for example, the user queries if there are a specific model&color car (i.e. “Gobi”, “blue”).

The mediator decides 2 nd query is needed or not based on the result of 1 st query. That is, If dealer-1 has the specific car, the mediator doesn’t have to query dealer-2.

Chapter 21 Information Integration

21.3 Wrappers in Mediator-Based Systems Presented by: Kai Zhu Professor: Dr. T.Y. Lin Class ID: 220

Wrappers in Mediator-based Systems  More complicated than that in most data warehouse system.

 Able to accept a variety of queries from the mediator and translate them to the terms of the source.

 Communicate the result to the mediator.

How to design a wrapper?

Classify the possible queries that the mediator can ask into templates, which are queries with parameters that represent constants.

Templates for Query Patterns:  Use notation T=>S to express the idea that the template T is turned by the wrapper into the source query S.

• Example 1 Dealer 1 Cars (serialNo, model, color, autoTrans, navi, … ) For use by a mediator with schema AutoMed (serialNo, model, color, autoTrans, dealer)

• We denote the code representing that color by the parameter $c , then the template will be: SELECT * FROM AutosMed WHERE color = ’ $c ’ ; => SELECT serialNo, model, color, autoTrans, ’ dealer1 ’ FROM Cars WHERE color= ’ $c ’ ; (Template T => Source query S) There will be total 2 n templates if we have the option of specifying n attributes.

• •

Wrapper Generators

The wrapper generator creates a table holds the various query patterns contained in the templates.

The source queries that are associated with each.

A driver is used in each wrapper, the task of the driver is to:  Accept a query from the mediator.

 Search the table for a template that matches the query.

 The source query is sent to the source, again using a “ plug-in ” communication mechanism.

 The response is processed by the wrapper.

• Filter Have a wrapper filter to supporting more queries.

• • Example 2 If wrapper is designed with more complicated template with queries specify both model and color.

SELECT * FROM AutosMed WHERE model = ’ $m ’ AND color = ’ $c ’ ; => SELECT serialNo, model, color, autoTrans, ’ dealer1 ’ FROM Cars WHERE model = ’ $m ’ AND color= ’ $c ’ ; • Now we suppose the only template we have is color. However the wrapper is asked by the Mediator to find “ blue Gobi model car.

Solution: 1. Use template with $c= ‘ blue ’ find all blue cars and store them in a temporary relation: TemAutos (serialNo, model, color, autoTrans, dealer) 2.The wrapper then return to the mediator the desired set of automobiles by excuting the local query: SELECT* FROM TemAutos WHERE model= ’ Gobi ’ ;

INFORMATION INTEGRATION

Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin

Sections 21.4 – 21.5

21.4 Capability Based Optimization

• Introduction – Typical DBMS estimates the cost of each query plan and picks what it believes to be the best – Mediator – has knowledge of how long its sources will take to answer – Optimization of mediator queries cannot rely on cost measure alone to select a query plan – Optimization by mediator follows capability based optimization

21.4.1 The Problem of Limited Source Capabilities

• • • • Many sources have only Web Based interfaces Web sources usually allow querying through a query form E.g. Amazon.com interface allows us to query about books in many different ways.

But we cannot ask questions that are too general – E.g. Select * from books;

21.4.1 The Problem of Limited Source Capabilities (con’t)

• Reasons why a source may limit the ways in which queries can be asked – Earliest database did not use relational DBMS that supports SQL queries – Indexes on large database may make certain queries feasible, while others are too expensive to execute – Security reasons • E.g. Medical database may answer queries about averages, but won’t disclose details of a particular patient's information

21.4.2 A Notation for Describing Source Capabilities

 For relational data, the legal forms of queries are described by adornments  Adornments – Sequences of codes that represent the requirements for the attributes of the relation, in their standard order  f(free) – attribute can be specified or not  b(bound) – must specify a value for an attribute but any value is allowed  u(unspecified) – not permitted to specify a value for a attribute

21.4.2 A notation for Describing Source Capabilities….(cont’d)

c[S](choice from set S) means that a value must be specified and value must be from finite set S.

o[S](optional from set S) means either do not specify a value or we specify a value from finite set S  A prime (f’) specifies that an attribute is not a part of the output of the query  A capabilities specification is a set of adornments  A query must match one of the adornments in its capabilities specification

21.4.2 A notation for Describing Source Capabilities….(cont’d)

 E.g. Dealer 1 is a source of data in the form: Cars (serialNo, model, color, autoTrans, navi) The adornment for this query form is b’uuuu

21.4.3 Capability-Based Query-Plan Selection

• • Given a query at the mediator, a capability based query optimizer first considers what queries it can ask at the sources to help answer the query The process is repeated until: – Enough queries are asked at the sources to resolve all the conditions of the mediator query and therefore query is answered. Such a plan is called feasible.

– We can construct no more valid forms of source queries, yet still cannot answer the mediator query. It has been an impossible query .

21.4.3 Capability-Based Query-Plan Selection (cont’d)

• • The simplest form of mediator query where we need to apply the above strategy is join relations E.g we have sources for dealer 2 – Autos(serial, model, color) – Options(serial, option) • Suppose that ubf is the sole adornment for Auto and Options have two adornments, bu and uc[autoTrans, navi] • Query is – find the serial numbers and colors of Gobi models with a navigation system

21.4.4 Adding Cost-Based Optimization

• • • • Mediator’s Query optimizer is not done when the capabilities of the sources are examined Having found feasible plans, it must choose among them Making an intelligent, cost based query optimization requires that the mediator knows a great deal about the costs of queries involved Sources are independent of the mediator, so it is difficult to estimate the cost

21.5 Optimizing Mediator Queries

• Chain algorithm – a greedy algorithm – answers the query by sending a sequence of requests to its sources . – Will always find a solution assuming at least one solution exists. – The solution may not be optimal.

21.5.1 Simplified Adornment Notation

• • A query at the mediator is limited to b (bound) and f (free) adornments. We use the following convention for describing adornments: – name adornments (attributes) – where: • • name is the name of the relation the number of adornments = the number of attributes

21.5.2 Obtaining Answers for Subgoals

• Rules for subgoals and sources: – Suppose we have the following subgoal:

R x

1

x

2 …x

n

(a 1 , a 2 , …, a

n

), and source adornments for R are: y 1

y

2 …y n . • If y

i

is b or c[S], then x

i

= b. • If x

i

= f, then y

i

is not output restricted. – The adornment on the subgoal matches the adornment at the source: • If y

i

is f, u, or o[S] and x

i

is either b or f.

21.5.3 The Chain Algorithm

• • • Maintains 2 types of information: – An adornment for each subgoal. – A relation X that is the join of the relations for all the subgoals that have been resolved. The adornment for a subgoal is b if the mediator query provides a constant binding for the corresponding argument of that subgoal. X is a relation over no attributes, containing just an empty tuple .

21.5.3 The Chain Algorithm (con’t)

 First, initialize adornments of subgoals and X.  Then, repeatedly select a subgoal that can be resolved. Let R

α

(a 1 , a 2 , …, a

n

) be the subgoal: 1.

Wherever α has a b, we shall find the argument in R is a constant, or a variable in the schema of R.  Project X onto its variables that appear in R.

21.5.3 The Chain Algorithm (con’t)

2. For each tuple t in the project of X, issue a query to the source as follows (

β

is a source – adornment). If

β

has b, then the corresponding component of α has b, and we can use the corresponding component of t for source query. – – If

β ha

s c[S], and the corresponding component of t is in S, then the corresponding component of α has b, and we can use the corresponding component of t for the source query. If

β

has f, and the corresponding component of α constant value for source query. is b, provide a

21.5.3 The Chain Algorithm (con’t)

– – – If a component of

β

is u, then provide no binding for this component in the source query. If a component of

β

is o[S], and the corresponding component of α is f, then treat it as if it was a f. If a component of

β

is o[S], and the corresponding component of α b, then treat it as if it was c[S]. is 3. Every variable among a 1 , a 2 , …, a

n

is now bound. For each remaining unresolved subgoal, change its adornment so any position holding one of these variables is b.

21.5.3 The Chain Algorithm (con’t)

4. Replace X with X π s(R), where S is all of the variables among: a 1 , a 2 , …, a

n

. 5. Project out of X all components that • in the head or in any unresolved subgoal. If every subgoal is resolved, then X is the answer. Else the algorithm fails

21.5.3 The Chain Algorithm Example

• Mediator query: – Q: Answer(c) ←

R bf

(

1

,

a

) AND

S ff

(

a

,

b

) AND

T ff

(

b

,

c

) • Example: Relation

w

R Data 1 1 1

x

2 3 4

x

2 3 S

y

4 5

y

4 5 5 T

z

6 7 8 Adornment

bf

c’[2,3,5]f

bu

21.5.3 The Chain Algorithm Example (con’t)

• • • Initially, the adornments on the subgoals are the same as Q, and X contains an empty tuple. – S and T cannot be resolved as they each have ff adornments, but the sources have either a, b or c. R(1,a) can be resolved because its adornments are matched by the source’s adornments. Send R(w,x) with w=1 to get the tables on the previous page.

21.5.3 The Chain Algorithm Example (con’t)

Project the subgoal’s relation onto its second component, since only the second component of R(1,a) is a variable.

a

2 3 4 • • This is joined with X, resulting in X equaling this relation. Change adornment on S from ff to bf.

21.5.3 The Chain Algorithm Example (con’t)

• • Now we resolve S

bf

(a,b): – Project X onto a, resulting in X.

– Now, search S for tuples with attribute a equivalent to attribute a in X.

a b

2 3 4 5 Join this relation with X, and remove a as it doesn’t appear in the head nor any unresolved subgoal:

b

4 5

21.5.3 The Chain Algorithm Example (con’t)

• Now we resolve T

bf

(b,c):

b c

4 5 5 6 7 8 • • Join this relation with X and project onto the c attribute to get the relation for the head. Solution is {(6), (7), (8)}.

21.5.4 Incorporating Union Views at the Mediator

• • • This implementation of the Chain Algorithm does not consider that several sources can contribute tuples to a relation. If specific sources have tuples to contribute that other sources may not have, it adds complexity. To resolve this, we can consult all sources, or make best efforts to return all the answers.

21.5.4 Incorporating Union Views at the Mediator (con’t)

• • Consulting All Sources – We can only resolve a subgoal when each source for its relation has an adornment matched by the current adornment of the subgoal. – Less practical because it makes queries harder to answer and impossible if any source is down. Best Efforts – We need only 1 source with a matching adornment to resolve a subgoal. – Need to modify chain algorithm to revisit each subgoal when that subgoal has new bound requirements.

Local-as-View Mediators

Priya Gangaraju(Class Id:203)

Local-as-View Mediators.

• In a LAV mediator, global predicates defined are not views of the source data.

Expressions are defined for each source with global predicates that describe tuples that source produces

Mediator answers the queries by constructing the views as provided by the source.

Motivation for LAV Mediators

• • •

Relationship between the data provided by the mediator and the sources is more subtle

For example, consider the predicate Par(c, p) meaning that p is a parent of c which represents the set of all child parent facts that could ever exist.

The sources will provide information about whatever child-parent facts they know.

Motivation(contd..)

• • • • There can be sources which may provide child grandparent facts but not child- parent facts at all.

This source can never be used to answer the child-parent query under GAV mediators.

LAV mediators allow to say that a certain source provides grand parent facts.

Used to discover how and when to use the source in a given query.

Terminology for LAV Mediation.

• • • • • •

The queries at mediator and those describing the source will be single Datalog rules A single Datalog rule is called a conjunctive query The global predicates of LAV mediator are used as subgoals of mediator queries.

Conjunctive queries define views. Their heads each have a unique view predicate that is name of a view.

Each view definition consists of global predicates and is associated with a particular source.

Each view is constructed with an all-free adornment.

Example..

• • • Consider global predicate Par(c, p) meaning that p is a parent of c. One source produces parent facts. Its view is defined by the conjunctive query V 1 (c, p)  Par(c, p) Another source produces some grand parents facts. Then its conjunctive query will be – V 2 (c, g)  Par(c, p) AND Par(p, g)

Example contd..

The query at mediator will ask for great-grand parent facts to be obtained from sources: Q(w, z)

Par(w, x) AND Par(x, y) AND Par(y, z)

• • One solution can be using the parent predicate(V 1 ) directly three times.

Q(w, z)  V 1 (w, x) AND V 1 (x, y) AND V 1 (y, z) Another solution can be to use V 1 (parent facts) and V 2 (grandparent facts).

Q(w, z)  V 1 (w, x) AND V 2 (x, z) Or Q(w, z)  V 2 (w, y) AND V 1 (y, z)

Expanding Solutions.

• • Consider a query Q, a solution S that has a body whose subgoals are views and each view V is defined by a conjunctive query with that view as the head.

The body of V’s conjunctive query can be substituted for a subgoal in S that uses the predicate V to have a body consisting of only global predicates.

Expansion Algorithm

• • A solution S has a subgoal V(a 1 , a 2 ,…an) where a i ’s can be any variables or constants.

The view V can be of the form • V(b 1 , b 2 ,….b

n )  B Where B represents the entire body.

V(a 1 , a 2 , … a n ) can be replaced in solution S by a version of body B that has all the subgoals of B with variables possibly altered.

Expansion Algorithm contd..

 The rules for altering the variables of B are: 1. First identify the local variables B, variables that appear in the body but not in the head.

2. If there are any local variables of B that appear in B or in S, replace each one by a distinct new variable that appears nowhere in the rule for V or in S.

3. In the body B, replace each b i 1,2…n.

by a i for i =

Example.

• • • Consider the view definitions, V 1 (c, p)  V 2 (c, g)  Par(c, p) Par(c, p) AND Par(p, g) One of the proposed solutions S is Q(w, z)  V 1 (w, x) AND V 2 (x, z) The first subgoal with predicate V 1 in the solution can be expanded as Par(w, x) as there are no local variables.

Example Contd.

• • • The V2 subgoal has a local variable p which doesn’t appear in S nor it has been used as a local variable in another substitution. So p can be left as it is.

Only x and z are to be substituted for variables c and g.

The Solution S now will be Q(w, z)  Par(w, x) AND Par(x, p) AND Par(p,z)

Containment of Conjunctive Queries

 A containment mapping from Q to E is a function т from the variables of Q to the variables and constants of E, such that: 1.

If x is the ith argument of the head of Q, then т(x) is the ith argument of the head of E.

2. Add to т the rule that т(c)=c for any constant c. If P(x 1 ,x 2 ,… x n ) is a subgoal of Q, then P(т(x 1 ), т(x 2 ),… т(x n )) is a subgoal of E.

Example.

• Consider two Conjunctive queries: Q 1 : H(x, y)  A(x, z) and B(z, y) • Q 2 : H(a, b)  A(a, c) AND B(d, b) AND A(a, d) When we apply the substitution, Т(x) = a, Т(y) = b, Т(z) = d, the head of Q 1 becomes H(a, b) which is the head of Q 2 .

So,there is a containment mapping from Q 1 Q 2 .

to

Example contd..

• • • The first subgoal of Q 1 becomes A(a, d) which is the third subgoal of Q 2 .

The second subgoal of Q 1 subgoal of Q 2 .

becomes the second There is also a containment mapping from Q 2 to Q 1 so the two conjunctive queries are equivalent.

Why the Containment-Mapping Test Works

• • • • Suppose there is a containment mapping т from Q 1 to Q 2 .

When Q 2 is applied to the database, we look for substitutions σ for all the variables of Q 2 .

The substitution for the head becomes a tuple t that is returned by Q 2 .

If we compose т and then σ, we have a mapping from the variables of Q Q 1 .

1 to tuples of the database that produces the same tuple t for the head of

Finding Solutions to a Mediator Query

 There can be infinite number of solutions built from the views using any number of subgoals and variables.

 • LMSS Theorem can limit the search which states that If a query Q has n subgoals, then any answer produced by any solution is also produced by a solution that has at most n subgoals.

 If the conjunctive query that defines a view V has in its body a predicate P that doesn’t appear in the body of the mediator query, then we need not consider any solution that uses V.

Example.

• • Recall the query Q1: Q(w, z)  Par(w, x) AND Par(x, y) AND Par(y, z) This query has three subgoals, so we don’t have to look at solutions with more than three subgoals.

Why the LMSS Theorem Holds

• • • Suppose we have a query Q with n subgoals and there is a solution S with more than n subgoals.

The expansion E of S must be contained in Query Q, which means that there is a containment mapping from Q to E.

We remove from S all subgoals whose expansion was not the target of one of Q’s subgoals under the containment mapping.

Contd..

• • • • We would have a new conjunctive query S’ with at most n subgoals.

If E’ is the expansion of S’ then, E’ is a subset of Q.

S is a subset of S’ as there is an identity mapping.

Thus S need not be among the solutions to query Q.

Query Compiler

By:Payal Gupta Roll No:106(225)

Professor :Tsau Young Lin

Pushing Selections

• • It is, replacing the left side of one of the rules by its right side.

In pushing selections we first a selection as far up the tree as it would go, and then push the selections down all possible branches.

• • • Let’s take an example: S t a r s I n ( t i t l e , year, starName) Movie(title, year, length, incolor, studioName, producerC#) • Define view MoviesOf 1996 by: CREATE VIEW MoviesOfl996 AS SELECT * FROM Movie ,WHERE year = 1996;

• "which stars worked for which studios in 1996?“ can be given by a SQL Query: SELECT starName, studioName FROM MoviesOfl996 NATURAL JOIN StarsIn;

ΠstarName,studioName

O Year=1996 StarsIn

Movie

Logical query plan constructed from definition of a query and view

Improving the query plan by moving selections up and down the tree

ΠstarName,studioName

O Year=1996 Movie

O

Year=1996 StarsIn

Laws Involving Projection

• • • "pushing" projections really involves introducing a new projection somewhere below an existing projection.

projection keeps the number of tuples the same and only reduces the length of tuples.

To describe the transformations of extended projection Consider a term E + x on the list for a projection, where E is an attribute or an expression involving attributes and constants and x is an output attribute.

Example

• Let R(a, b, c) and S(c, d, e) be two relations. Consider the expression x,+,,,, b+y(R w S). The input attributes of the projection are a,b, and e, and c is the only join attribute. We may apply the law for pushing projections below joins to get the equivalent expression:

Πa+e->x,b->y(Πa,b,c(R) Πc,e(S))

• Eliminating this projection and getting a third equivalent expression: Πa+e->x, b->y( R Πc,e(S))

• In addition, we can perform a projection entirely before a bag union. That is:

ΠL(R UB S)= ΠL(R) )UB ΠL(S)

Laws About Joins and Products

• laws that follow directly from the definition of the join: • • R S = ΠL( c ( R * S) ) , where C is the condition that equates each pair of attributes from R and S with the same name. and L is a list that includes one attribute from each equated pair and all the other attributes of R and S.

We identify a product followed by a selection as a join of some kind.

Laws Involving Duplicate Elimination

• • • The operator δ which eliminates duplicates from a bag can be pushed through many but not all operators. In general, moving a δ down the tree reduces the size of intermediate relations and may therefore beneficial. Moreover, sometimes we can move δ to a position where it can be eliminated altogether,because it is applied to a relation that is known not to possess duplicates.

• δ (R)=R if R has no duplicates. Important cases of such a relation R include: a) A stored relation with a declared primary key, and b) A relation that is the result of a γ operation, since grouping creates a relation with no duplicates.

• • • • • Several laws that "push" δ through other operators are: δ (R*S) =δ(R) * δ(S) δ (R S)=δ(R) δ(S) δ (R c S)=δ(R) c δ(S) • • We can also move the δ to either or both of the arguments of an intersection: δ (R ∩ B S) = δ(R) ∩ B S = R ∩ B δ (S) = δ(R) ∩ B δ (S)

Laws Involving Grouping and Aggregation

• • When we consider the operator γ, we find that the applicability of many transformations depends on the details of the aggregate operators used. Thus we cannot state laws in the generality that we used for the other operators. One exception is that a γ absorbs a δ . Precisely: δ(γ L (R))=γ L (R)

• let us call an operator γ duplicate-impervious if the only aggregations in L are MIN and/or MAX then: • γ L(R) = γ L (δ(R)) provided γL is duplicate-

impervious.

Example

• Suppose we have the relations MovieStar(name , addr , gender, birthdate) StarsIn(movieTitle, movieyear, starname) and we want to know for each year the birthdate of the youngest star to appear in a movie that year. We can express this query as: SELECT movieyear, MAX(birth date) FROM MovieStar, StarsIn WHERE name = starName GROUP BY movieyear;

γ movieYear, MAX ( birthdate )

O name = starName

MovieStar StarsIn

Initial logical query plan for the query

• Some transformations that we can apply to Fig are 1. Combine the selection and product into an equijoin.

2.Generate a δ below the γ , since the γ is duplicate impervious.

3. Generate a Π between the γ and the introduced δ to project onto movie-Year and birthdate, the only attributes relevant to the γ

γ movieYear, MAX ( birthdate )

Π movieYear, birthdate

δ

name = starName

MovieStar StarsIn

Another query plan for the query

γ movieYear, MAX ( birthdate )

Π movieYear, birthdate

δ

name = starName

δ

Π birthdate,name Π movieYear,starname

MovieStar StarsIn

third query plan for Example

The Query Compiler

Section 16.3

DATABASE SYSTEMS – The Complete Book Presented By: Deepti Kundu Under the supervision of: Dr. T.Y.Lin

Review

Query Parser Preprocessor Logical query plan generator Query Rewriter Preferred logical query plan Section 16.1

Section 16.3

Two steps to turn Parse tree into Preferred Logical Query Plan

Replace the nodes and structures of the parse tree, in appropriate groups, by an operator or operators of relational algebra.

Take the relational algebra expression and turn it into an expression that we expect can be converted to the most efficient physical query plan.

Reference Relations

• •

StarsIn (movieTitle, movieYear, starName) MovieStar (name, address, gender, birthdate)

Conversion to Relational Algebra

If we have a with a that has no subqueries, then we may replace the entire construct – the select-list, from-list, and condition – by a relational-algebra expression.

• •

The relational-algebra expression consists of the following from bottom to top:

The products of all the relations mentioned in the , which Is the argument of:

A selection σ C, where C is the expression in the construct being replaced, which in turn is the argument of: A projection π L , where L is the list of attributes in the Example: SELECT movieTitle FROM Starsin, MovieStar WHERE starName = name AND birthdate LIKE ‘%1960’;

SELECT movieTitle FROM Starsin, MovieStar WHERE starName = name AND birthdate LIKE ‘%1960’;

Translation to an algebraic expression tree

Removing Subqueries From Conditions

• • •

For parse trees with a that has a subquery Intermediate operator – two argument selection It is intermediate in between the syntactic categories of the parse tree and the relational algebra operators that apply to relations.

Using a two-argument σ

StarsIn π movieTitle σ IN π name σ birthdate LIKE ‘%1960' starName MovieStar

Two argument selection with condition involving IN

Now say we have, two arguments – some relation and the second argument is a of the form t IN S.

• •

‘t’ – tuple composed of some attributes of R ‘S’ – uncorrelated subquery

1.

2.

3.

Steps to be followed:

Replace the by the tree that is the expression for S ( δ is used to remove duplicates) Replace the two-argument selection by a one-argument selection σ

C.

Give σ

C

an argument that is the product of R and S.

Two argument selection with condition involving IN

R σ t

IN S R

σ

C X

δ

S

The effect

Improving the Logical Query Plan

Algebraic laws to improve logical query plans:

Selections can be pushed down the expression tree as far as they can go.

Similarly, projections can be pushed down the tree, or new projections can be added.

Duplicate eliminations can sometimes be removed, or moved to a more convenient position in the tree.

Certain selections can be combined with a product below to turn the pair of operations into an equijoin.

Grouping Associative/ Commutative Operators

• • • •

An operator that is associative and commutative operators may be though of as having any number of operands.

We need to reorder these operands so that the multiway join is executed as sequence of binary joins.

Its more time consuming to execute them in the order suggested by parse tree.

For each portion of subtree that consists of nodes with the same associative and commutative operator (natural join, union, and intersection), we group the nodes with these operators into a single node with many children.

The effect of query rewriting

StarsIn

Π movieTitle Starname = name σ birthdate LIKE ‘%1960 ’

MovieStar

Final step in producing logical query plan

U U V R S U T W

=>

R U S U T V W

An Example to summarize

• •

“find movies where the average age of the stars was at most 40 when the movie was made” SELECT distinct m1.movieTitle, m1,movieYear FROM StarsIn m1 WHERE m1.movieYear – 40 <= ( SELECT AVG (birthdate) FROM StartsIn m2, MovieStar s WHERE m2.starName = s.name AND m1.movieTitle = m2.movieTitle AND m1.movieYear = m2.movieyear

);

SELECT distinct m1.movieTitle, m1,movieYear FROM StarsIn m1 WHERE m1.movieYear – 40 <= ( SELECT AVG (birthdate) FROM StartsIn m2, MovieStar s WHERE m2.starName = s.name AND m1.movieTitle = m2.movieTitle AND m1.movieYear = m2.movieyear );

Selections combined with a product to turn the pair of operations into an equijoin…

Condition pushed up the expression tree…

`

Selections combined…

16.4 Estimating the Cost of Operations Project Guide Dr. T. Y. Lin San Jose State University Prepared By Akshay Shenoy Computer Science Dept

Estimating the Cost of Operations

• • • After getting to the logical query plan, we turn it into physical plan.

Consider all the possible physical plan and estimate their costs – this evaluation is known as

cost-based enumeration.

The one with least estimated cost is the one selected to be passed to the query-execution engine.

Physical Plan For each physical plan select

• • • • An order and grouping for associative-and commutative operations like joins, unions. An Algorithm for each operator in the logical plan.

eg: whether nested loop join or hash join to be

used

Additional operators that are needed for the physical plan but that were not present explicitly in the logical plan. eg: scanning, sorting The way in which arguments are passed from one operator to the next.

Estimating Sizes of Intermediate Relations

Rules for estimating the number of tuples in an intermediate relation: 1. Give accurate estimates 2. Are easy to compute • 3. Are logically consistent Objective of estimation is to select best physical plan with least cost.

Estimating the Size of a Projection

We should treat a classical, duplicate-eliminating projection as a bag-projection.

The projection is different from the other operators, in that the size of the result is computable. Since a projection produces a result tuple for every argument tuple, the only change in the output size is the change in the lengths of the tuples .

Estimating the Size of a Selection(1)

• • • While performing selection, we may reduce the number of tuples but the sizes of tuple remain same.

S

 

A

c

(

R

) R and C is a constant. Then we recommend as an estimate:

T(S) =T(R)/V(R,A)

The rule above surely holds if all values of attribute A occur equally often in the database.

Estimating the Size of a Selection(2)

• • •

S

  (

R

) T(s) is:

T(S) = T(R)/3

We may use T(S)=T(R)(V(R,a) -1 )/ V(R,a) as an estimate. When the selection condition C is the And of several equalities and inequalities, we can treat the selection as a cascade of simple  (

R

) conditions.

Estimating the Size of a Selection(3)

• A less simple, but possibly more accurate estimate of the

S

 

(R) m

which satisfy C2, we would estimate the number of tuples in S as

n

( 1  ( 1 

m

1 /

n

)( 1 

m

2 /

n

)) 1 

m

1  /

m n

/

n

satisfy C2 . The product of these numbers is the fraction of R’s tuples that are not in S, and 1 minus this product is the fraction that are in S.

Estimating the Size of a Join

• two simplifying assumptions: 1. Containment of Value Sets

If R and S are two relations with attribute Y and V(R,Y)<=V(S,Y) then every Y-value of R will be a Y-value of S.

2. Preservation of Value Sets

Join a relation R with another relation S with attribute A in R and not in S

Under these assumptions, we estimate

Natural Joins With Multiple Join Attributes

Of the T(R),T(S) pairs of tuples from R and S, the expected number of pairs that match in both y1 and y2 is: T(R)T(S)/max(V(R,y1), V(S,y1)) max(V(R, y2), V(S, y2)) In general, the following rule can be used to estimate the size of a natural join when there are any number of attributes shared between the two relations.

 multiplying T(R) by T(S) and dividing by the largest of V(R,y) and V(S,y) for each attribute y that is common to R and S.

Joins of Many Relations(1)

• rule for estimating the size of any join Start with the product of the number of tuples in each relation. Then, for each attribute A appearing at least twice, divide by all but the least of V(R,A)’s. We can estimate the number of values that will remain for attribute A after the join. By the preservation-of-value-sets assumption, it is the least of these V(R,A)’s.

Joins of Many Relations(2)

• Based on the two assumptions-containment and preservation of value sets: No matter how we group and order the terms in a natural join of n relations, the estimation of rules, applied to each join individually, yield the same estimate for the size of the result. Moreover, this estimate is the same that we get if we apply the rule for the join of all n relations as a whole.

Estimating Sizes for Other Operations

• • • • • Union: the average of the sum and the larger. Intersection: approach1: take the average of the extremes, which is the half the smaller.

approach2: intersection is an extreme case of the natural join, use the formula

Estimating Sizes for Other Operations

• Difference: T(R)-(1/2)*T(S) • Duplicate Elimination: take the smaller of

a

• Grouping and Aggregation: upper-bound the number of groups by a product of V(R,A)’s, here attribute A ranges over only the grouping attributes of L. An estimate is the smaller of (1/2)*T(R) and this product.

16.5 Introduction to Cost-based plan selection

Amith KC Student Id: 109

• • Whether selecting a logical query plan or constructing a physical query plan from a logical plan, the query optimizer needs to estimate the cost of evaluating certain expressions.

We shall assume that the "cost" of evaluating an expression is approximated well by the number of disk I/O's performed.

The number of disk I/O’s, in turn, is influenced by: 1. The particular logical operators chosen to implement the query, a matter decided when we choose the logical query plan.

2. The sizes of intermediate results (whose estimation we discussed in Section 16.4) 3. The physical operators used to implement logical operators. e.g.. The choice of a one-pass or two-pass join, or the choice to sort or not sort a given relation.

4. The ordering of similar operations, especially joins 5.

The method of passing arguments from one physical operator to the next.

• • • •

Obtaining Estimates for Size Parameter

The formulas of Section 16.4 were predicated on knowing certain important parameters, especially T(R), the number of tuples in a relation R, and V(R, a), the number of different values in the column of relation R for attribute a.

A modern DBMS generally allows the user or administrator explicitly to request the gathering of statistics, such as T(R) and V(R, a). These statistics are then used in subsequent query optimizations to estimate the cost of operations.

By scanning an entire relation R, it is straightforward to count the number of tuples T(R) and also to discover the number of different values V(R, a) for each attribute a.

The number of blocks in which R can fit, B(R), can be estimated either by counting the actual number of blocks used (if R is clustered), or by dividing T(R) by the number of tuples per block

Computation of Statistics

• Periodic re-computation of statistics is the norm in most DBMS's, for several reasons.

– – First, statistics tend not to change radically in a short time. Second, even somewhat inaccurate statistics are useful as long as they are applied consistently to all the plans. – Third, the alternative of keeping statistics up-to-date can make the statistics themselves into a "hot-spot" in the database; because statistics are read frequently, we prefer not to update them frequently too.

• • • • The recomputation of statistics might be triggered automatically after some period of time, or after some number of updates.

However, a database administrator noticing, that poor performing query plans are being selected by the query optimizer on a regular basis, might request the recomputation of statistics in an attempt to rectify the problem.

Computing statistics for an entire relation R can be very expensive, particularly if we compute V(R, a) for each attribute a in the relation.

One common approach is to compute approximate statistics by sampling only a fraction of the data. For example, let us suppose we want to sample a small fraction of the tuples to obtain an estimate for V(R, a).

Heuristics for Reducing the Cost of Logical Query Plans

• • • • One important use of cost estimates for queries or sub queries is in the application of heuristic transformations of the query.

We have already observed previously how certain heuristics applied independent of cost estimates can be expected almost certainly to improve the cost of a logical query plan.

However, there are other points in the query optimization process where estimating the cost both before and after a transformation will allow us to apply a transformation where it appears to reduce cost and avoid the transformation otherwise.

In particular, when the preferred logical query plan is being generated, we may consider a number of optional transformations and the costs before and after.

• • • Because we are estimating the cost of a logical query plan, so we have not yet made decisions about the physical operators that will be used to implement the operators of relational algebra, our cost estimate cannot be based on disk I/Os.

Rather, we estimate the sizes of all intermediate results using the techniques of Section 16.1, and their sum is our heuristic estimate for the cost of the entire logical plan. For example,

• Consider the initial logical query plan of as shown below, δ σ a = 10 R S

• The statistics for the relations R and S be as follows R(a, b) T(R) = 5000 V(R, a) = 50 V(R, b) = 100 S(b, c) T(S) = 2000 V(S, a) = 200 V(S, b) = 100 • To generate a final logical query plan from, we shall insist that the selection be pushed down as far as possible. However, we are not sure whether it makes sense to push the δ below the join or not. Thus, we generate from the two query plans shown in next slide. They differ in whether we have chosen to eliminate duplicates before or after the join.

50 δ 100 σ a = 10 250 δ S 2000 1000 5000 R (a) 100 σ a = 10 500 δ 1000 0 S 2000 5000 R (b)

• • We know how to estimate the size of the result of the selections, we divide T(R) by V(R, a) = 50. We also know how to estimate the size of the joins; we multiply the sizes of the arguments and divide by max(V(R, b), V(S, b)), which is 200.

Approaches to Enumerating Physical Plans

• • • Let us consider the use of cost estimates in the conversion of a logical query plan to a physical query plan.

The baseline approach, called exhaustive, is to consider all combinations of choices (for each of issues like order of joins, physical implementation of operators, and so on). Each possible physical plan is assigned an estimated cost, and the one with the smallest cost is selected.

• There are two broad approaches to exploring the space of possible physical plans: – Top-down: Here, we work down the tree of the logical query plan from the root.

– Bottom-up: For each sub-expression of the logical-query-plan tree, we compute the costs of all possible ways to compute that sub expression. The possibilities and costs for a sub-expression E are computed by considering the options for the sub-expressions for E, and combining them in all possible ways with implementations for the root operator of E.

Branch-and-Bound Plan Enumeration • • This approach, often used in practice, begins by using heuristics to find a good physical plan for the entire logical query plan. Let the cost of this plan be C. Then as we consider other plans for sub-queries, we can eliminate any plan for a sub-query that has a cost greater than C, since that plan for the sub-query could not possibly participate in a plan for the complete query that is better than what we already know. Likewise, if we construct a plan for the complete query that has cost less than C, we replace C by the cost of this better plan in subsequent exploration of the space of physical query plans.

Hill Climbing • • • This approach, in which we really search for a “valley” in the space of physical plans and their costs; starts with a heuristically selected physical plan. We can then make small changes to the plan, e.g., replacing one method for an operator by another, or reordering joins by using the associative and/or commutative laws, to find "nearby" plans that have lower cost. When we find a plan such that no small modification yields a plan of lower cost, we make that plan our chosen physical query plan.

Dynamic Programming • • In this variation of the general bottom-UP strategy, we keep for each sub-expression only the plan of least cost. As we work UP the tree, we consider possible implementations of each node, assuming the best plan for each sub-expression is also used.

Selinger-Style Optimization • This approach improves upon the dynamic-programming approach by keeping for each sub-expression not only the plan of least cost, but certain other plans that have higher cost, yet produce a result that is sorted in an order that may be useful higher up in the expression tree. Examples of such interesting orders are when the result of the sub-expression is sorted on one of: – The attribute(s) specified in a sort (r) operator at the root – – The grouping attribute(s) of a later group-by (γ) operator.

The join attribute(s) of a later join.

Choosing an Order for Joins

Chapter 16.6 by: Chiu Luk ID: 210

Introduction

• • This section focuses on critical problem in cost-based optimization: – Selecting order for natural join of three or more relations Compared to other binary operations, joins take more time and therefore need effective optimization techniques

Significance of Left and Right Join Arguments

• • The argument relations in joins determine the cost of the join The left argument of the join is – Called the build relation – – Assumed to be smaller Stored in main-memory

Significance of Left and Right Join Arguments

• • The right argument of the join is – Called the probe relation – Read a block at a time – Its tuples are matched with those of build relation The join algorithms which distinguish between the arguments are: – One-pass join – – Nested-loop join Index join

Join Trees

• • • • Order of arguments is important for joining two relations Left argument, since stored in main-memory, should be smaller With two relations only two choices of join tree With more than two relations, there are n! ways to order the arguments and therefore n! join trees, where n is the no. of relations

Join Trees

• Total # of tree shapes T(n) for n relations given by recurrence: • • • • T(1) = 1 T(2) = 1 T(3) = 2 T(4) = 5 … etc

Left-Deep Join Trees

• Consider 4 relations. Different ways to join them are as follows

• • • • In fig (a) all the right children are leaves. This is a left-deep tree In fig (c) all the left children are leaves. This is a right-deep tree Fig (b) is a bushy tree Considering left-deep trees is advantageous for deciding join orders

Join order

Join order selection – – A1 A2 A3 .. An Left deep join trees An – Ai Dynamic programming • Best plan computed for each subset of relations – Best plan (A1, .., An) = min cost plan of( Best plan(A2, .., An) A1 Best plan(A1, A3, .., An) A2 ….

Best plan(A1, .., An-1)) An

Dynamic Programming to Select a Join Order and Grouping

• • Three choices to pick an order for the join of many relations are: – – Consider all of the relations Consider a subset – Use a heuristic o pick one Dynamic programming is used either to consider all or a subset – Construct a table of costs based on relation size – Remember only the minimum entry which will required to proceed

Dynamic Programming to Select a Join Order and Grouping

• Dynamic programming is used either to consider all or a subset • Construct a table of costs based on relation size • Remember only the minimum entry which will required to proceed

Dynamic Programming to Select a Join Order and Grouping

Dynamic Programming to Select a Join Order and Grouping

Dynamic Programming to Select a Join Order and Grouping

Dynamic Programming to Select a Join Order and Grouping

• • Disadvantage of dynamic programming is that it does not involve the actual costs of the joins in the calculations Can be improved by considering • Use disk’s I/O for evaluating cost • When computing cost of R1 join R2, since we sum cost of R1 and R2, we must also compute estimates for there sizes

A Greedy Algorithm for Selecting a Join Order

• • • It is expensive to use an exhaustive method like dynamic programming Better approach is to use a join-order heuristic for the query optimization Greedy algorithm is an example of that – Make one decision at a time about order of join and never backtrack on the decisions once made

Query Execution

Chapter 15 Section 15.1

Presented by Khadke, Suvarna CS 257 (Section II) Id 213

295

What is query processing?

A given SQL query is translated by the query processor into a low level execution plan • An execution plan is a program in a functional language: – The physical relational algebra, specific for each DBMS.

• The physical relational algebra extends the relational algebra with: – Primitives to search through the internal data structures of the DBMS

What is a Query Processor

 Group of components of a DBMS that converts a user queries and data-modification commands into a sequence of database operations  It also executes those operations  Must supply detail regarding how the query is to be executed 297

Major parts of Query processor

Query Execution:  The algorithms that manipulate the data of the database.

 Focus on the operations of extended relational algebra . 298

Query Processing Steps

SQL Query PARSER (parsing and semantic checking as in any compiler) Parse tree (~ tree structure representing relational calculus expression) OPTIMIZER (very advanced) Execution plan (annotated relation algebra expression) EXECUTOR (execution plan interpreter) DBMS kernel Data structures

Outline of Query Compilation

Query compilation

Parsing : A parse tree for the query is constructed  Query Rewrite : The parse tree is converted to an initial query plan and transformed into logical query plan (less time)  Physical Plan Generation : Logical Q Plan is converted into physical query plan by selecting algorithms and order of execution of these operator.

300

Basic Steps in Query Processing

Physical-Query-Plan Operators

 Physical operators are implementations of the operator of relational algebra.

 They can also be use in non relational algebra

operators like “scan”

memory which scans tables, that is, bring each tuple of some relation into main 302

Basic Steps in Query Processing

1. Parsing and translation 2. Optimization 3. Evaluation ROUTERS

Basic Steps in Query Processing (Cont.)

– –

Parsing and translation

translate the query into its internal form. This is then translated into relational algebra.

Parser checks syntax, verifies relations Evaluation

The query-execution engine takes a query evaluation plan, executes that plan, and returns the answers to the query

ROUTERS

Scanning Tables

  One of the basic thing we can do in a Physical query plan is to read the entire contents of a relation R.

Variation of this operator involves simple predicate, read only those tuples of the relation R that satisfy the predicate.

305

Scanning Tables

Basic approaches to locate the tuples of a relation R 

Table Scan

 Relation R is stored in secondary memory with its tuples arranged in blocks  It is possible to get the blocks one by one 

Index-Scan

If there is an index on any attribute of Relation R, we can use this index to get all the tuples of Relation R

306

Sorting While Scanning Tables

Number of reasons to sort a relation

 Query could include an ORDER BY clause, requiring that a relation be sorted.

 Algorithms to implement relational algebra operations requires one or both arguments to

be sorted relations.

Physical-query-plan operator sort-scan takes a relation R, attributes on which the sort is to be made, and produces R in that sorted order 307

Computation Model for Physical Operator

 Physical-Plan Operator should be selected wisely which is essential for good Query Processor .

 For “cost” of each operator is estimated by number of disk I/O’s for an operation.

 The total cost of operation depends on the size of the answer, and includes the final write back cost to the total cost of the query.

308

Parameters for Measuring Costs

 Parameters that affect the performance of a query  Buffer space availability in the main memory at the time of execution of the query  Size of input and the size of the output generated 

The size of memory block on the disk and the size in the main memory also affects the performance

309

Parameters for Measuring Costs

      B: The number of blocks are needed to hold all tuples of relation R. Also denoted as B(R) T:The number of tuples in relationR. Also denoted as T(R) V: The number of distinct values that appear in a column of a relation R V(R, a)- is the number of distinct values of column for a in relation R 310

I/O Cost for Scan Operators

 If relation R is clustered, then the number of disk I/O for the table-scan operator is = ~B disk I/O’s 

If relation R is not clustered, then the number of required disk I/O generally is much higher

 A index on a relation R occupies many fewer than B(R) blocks That means a scan of the entire relation R which takes at least B disk I/O’s will require more I/O’s than the entire index 311

Iterators for Implementation of Physical Operators

Many physical operators can be implemented as an Iterator.

Three methods forming the iterator for an operation are:

1. Open( ) :

This method starts the process of getting tuples

It initializes any data structures needed to perform the operation

312

Iterators for Implementation of Physical Operators

2. GetNext( ):

 Returns the next tuple in the result  If there are no more tuples to return, GetNext returns a special value NotFound

3. Close( ) :

 Ends the iteration after all tuples  It calls Close on any arguments of the operator 313

Query Execution One-Pass Algorithms for Database Operations (15.2) Presented by

Ronak Shah (214) April 22, 2009 314

Categorizing Algorithms

By general technique

sorting-based

hash-based

index-based

By the number of times data is read from disk

one-pass

• •

two-pass multi-pass (more than 2)

By what the operators work on

tuple-at-a-time, unary

full-relation, unary

full-relation, binary

One-Pass Algorithm Methods

 Tuple-at-a-time, unary operations: (selection & projection)  Full-relation, unary operations  Full-relation, binary operations (set & bag versions of union) 316

One-Pass, Tuple-at-a-Time

These are for SELECT and PROJECT

Algorithm:

read the blocks of R sequentially into an input buffer

perform the operation

move the selected/projected tuples to an output buffer

Requires only M ≥ 1

I/O cost is that of a scan (either B or T, depending on if R is clustered or not)

Exception! Selecting tuples that satisfy some condition on an indexed attribute can be done faster !

One-Pass Algorithms for Tuple-at-a-Time Operations

 Tuple-at-a-time operations are selection and projection  read the blocks of

R

one at a time into an input buffer  perform the operation on each tuple  move the selected tuples or the projected tuples to the output buffer  The disk I/O requirement for this process depends only on how the argument relation R is provided.  If R is initially on disk, then the cost is whatever it takes to perform a table-scan or index-scan of R. 318

One-Pass, Tuple-at-a-Time

duplicate elimination (DELTA)

Algorithm:

keep a main memory search data structure D (use search tree or hash table) to store one copy of each tuple

• • •

read in each block of R one at a time (use scan) for each tuple check if it appears in D if not then add it to D and to the output buffer

Requires 1 buffer to hold current block of R; remaining M-1 buffers must be able to hold D

I/O cost is just that of the scan

One-Pass, Unary, Full-Relation

duplicate elimination (DELTA)

Algorithm:

keep a main memory search data structure D (use search tree or hash table) to store one copy of each tuple

• • •

read in each block of R one at a time (use scan) for each tuple check if it appears in D if not then add it to D and to the output buffer

Requires 1 buffer to hold current block of R; remaining M-1 buffers must be able to hold D

I/O cost is just that of the scan

A selection or projection being performed on a relation R

321

One-Pass Algorithms for Unary, fill Relation Operations

 Duplicate Elimination  To eliminate duplicates, we can read each block of

R

one at a time, but for each tuple we need to make a decision as to whether: 1.

It is the first time we have seen this tuple, in which case we copy it to the output, or 2.

We have seen the tuple before, in which case we must not output this tuple.

 One memory buffer holds one block of R's tuples, and the remaining M - 1 buffers can be used to hold a single copy of every tuple.

322

Managing memory for a one-pass duplicate-elimination

323

Duplicate Elimination

 When a new tuple from

R

is considered, we compare it with all tuples seen so far  if it is not equal: we copy both to the output and add it to the in-memory list of tuples we have seen.  if there are n tuples in main memory: each new tuple takes processor time proportional to n, so the complete operation takes processor time proportional to n 2 . 

We need a main-memory structure that allows each of the operations:

Add a new tuple, and

Tell whether a given tuple is already there

324

Duplicate Elimination (…contd.)

 The different structures that can be used for such main memory structures are:  Hash table  Balanced binary search tree 325

One-Pass Algorithms for Unary, fill Relation Operations

 Grouping  The grouping operation gives us zero or more grouping attributes and presumably one or more aggregated attributes  If we create in main memory one entry for each group then we can scan the tuples of R, one block at a time.

The entry for a group consists of values for the grouping attributes and an accumulated value or values for each aggregation.

326

Grouping

 The accumulated value is:  For

MIN(a)

or

MAX(a)

/maximum value, respectively.

aggregate, record minimum  For any COUNT aggregation, add 1 for each tuple of group.

 For

SUM(a),

for its group.

add value of attribute a to the accumulated sum 

AVG(a) is a hard case. We must maintain 2 accumulations: count of no. of tuples in the group & sum of a-values of these tuples. Each is computed as we would for a COUNT & SUM aggregation, respectively. After all tuples of R are seen, take quotient of sum & count to obtain average.

327

One-Pass Algorithms for Binary Operations

 Binary operations include:  Union  Intersection  Difference  Product  Join 328

Set Union

 We read S into M - 1 buffers of main memory and build a search structure where the search key is the entire tuple. 

All these tuples are also copied to the output.

Read each block of R into the M

th

buffer, one at a time.

 For each tuple

t

of R, see if

t

is in S, and if not, we copy

t

to the output. If

t

is also in S, we skip

t.

329

Set Intersection

 Read S into M - 1 buffers and build a search structure with full tuples as the search key.  Read each block of R, and for each tuple t of R, see if t is also in S. If so, copy t to the output, and if not, ignore t.

330

Set Difference

 Read S into M - 1 buffers and build a search structure with full tuples as the search key.

 To compute R -s

S,

read each block of R and examine each tuple

t

on that block. If t is in S, then ignore

t;

if it is not in S then copy

t

to the output.

To compute S -s R, read the blocks of R and examine each tuple t in turn. If t is in S, then delete t from the copy of S in main memory, while if t is not in S do nothing.

After considering each tuple of R, copy to the output those tuples of S that remain.

331

Bag Intersection

 Read

S

into

M

- 1 buffers.  Multiple copies of a tuple t are not stored individually. Rather store 1 copy of t & associate with it a count equal to no. of times t occurs.  Next, read each block of R, & for each tuple t of R see whether t occurs in

S.

If not ignore

t;

it cannot appear in the intersection. If

t

appears in

S,

& count associated with

t

is (+)ve, then output

t

& decrement count by 1. If t appears in

S,

but count has reached 0, then do not output t; we have already produced as many copies of t in output as there were copies in

S.

332

Bag Difference

 To compute S

-B

R, read tuples of S into main memory & count no. of occurrences of each distinct tuple.  Then read R; check each tuple

t

to see whether t occurs in S, and if so, decrement its associated count. At the end, copy to output each tuple in main memory whose count is positive, & no. of times we copy it equals that count.

 To compute R

-B

S, read tuples of occurrences of distinct tuples.

S

into main memory & count no. of 333

Bag Difference (…contd.)

 Think of a tuple

t

with a count of

c

as

c

reasons not to copy

t

to the output as we read tuples of R. 

Read a tuple t of R; check if t occurs in S. If not, then copy t to the output. If t does occur in S, then we look at current count c associated with t. If c = 0, then copy t to output. If c > 0, do not copy t to output, but decrement c by 1.

334

Product

 Read

S

into

M

- 1 buffers of main memory  Then read each block of R, and for each tuple t of R concatenate

t

with each tuple of

S

in main memory.  Output each concatenated tuple

as

it is formed.  This algorithm may take a considerable amount of processor time per tuple of

R,

because each such tuple must be matched with

M

- 1 blocks full of tuples. However, output size is also large, & time/output tuple is small.

335

Natural Join

 Convention: R(X, Y) is being joined with S(Y, Z), where Y represents all the attributes that R and

S

have in common, X is all attributes of R that are not in the schema of S, & Z is all attributes of S that are not in the schema of R. Assume that S is the smaller relation.  To compute the natural join, do the following: 1.

Read all tuples of S & form them into a main-memory search structure. Hash table or balanced tree are good e.g. of such structures. Use M - 1 blocks of memory for this purpose .

336

Natural Join

1.

Read each block of R into 1 remaining main-memory buffer. For each tuple t of R, find tuples of S that agree with t on all attributes of Y, using the search structure. For each matching tuple of S, form a tuple by joining it with t, & move resulting tuple to output.

337

QUERY EXECUTION 15.3

Nested-Loop Joins

By: Saloni Tamotia (215)

Introduction to Nested-Loop Joins

 Used for relations of any side.

 Not necessary that relation fits in main memory  Uses “One-and-a-half

pass method in which for each variation: – One argument read just once.

– Other argument read repeatedly.

 Two kinds: • Tuple-Based Nested Loop Join • Block-Based Nested Loop Join

ADVANTAGES OF NESTED-LOOP JOIN

Fits in the iterator framework.

Allows us to avoid storing intermediate relation on disk.

Tuple-Based Nested-Loop Join

Simplest variation of the nested-loop join

Loop ranges over individual tuples

Tuple-Based Nested-Loop Join

 Algorithm to compute the Join R(X,Y) | | S(Y,Z)

FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to make tuple t THEN output t

 R and S are two Relations with r and s as tuples.  carelessness in buffering of blocks causes the use of T(R)T(S) disk I/O’s

IMPROVEMENT & MODIFICATION

To decrease the cost

 Method 1: Use algorithm for Index-Based joins – We find tuple of R that matches given tuple of S – We need not to read entire relation R  Method 2: Use algorithm for Block-Based joins – Tuples of R & S are divided into blocks – Uses enough memory to store blocks in order to reduce the number of disk I/O’s.

• • • • • • • • • • • • • • • • • •

An Iterator for Tuple-Based Nested-Loop Join

Open0 C R.Open() S . Open () GetNextO { REPEAT C r := R.GetNext(); IF (r = NotFound) C /* R is exhausted for the current s */ R.Close(); s := S.GetNext(); IF (s = NotFound) RETURN NotFound; /* both R and S are exhausted */ R.Open0 ; r := R.GetNext(); UNTIL(r and s join) ; RETURN the join of r and s; Close0 ( R. Close () ; S. Close () ;

Block-Based Nested-Loop Join Algorithm

Access to arguments is organized by block.

– While reading tuples of inner relation we use less number of I/O’s disk.

Using enough space in main memory to store tuples of relation of the outer loop.

– Allows to join each tuple of the inner relation with as many tuples as possible.

• • • • • • • • • • • • • FOR each chunk of M-1 blocks of S DO BEGIN read these blocks into main-memory buffers; organize their tuples into a search structure whose search key is the common attributes of R and S; FOR each block b of R DO BEGIN read b into main memory; FOR each tuple t of b DO BEGIN find the tuples of S in main memory that join with t ; output the join of t with each of these tuples; END ;

END ;

END ;

Block-Based Nested-Loop Join Algorithm

ALGORITHM:

FOR each chunk of M -

1

blocks of S DO FOR each block b of R DO FOR each tuple t of b DO find the tuples of S in memory that join with t output the join of t with each of these tuples

Block-Based Nested-Loop Join Algorithm

Assumptions:

– B(S)

B(R) – B(S) > M – This means that the neither relation fits in the entire main memory.

Analysis of Nested-Loop Join

– Number of disk I/O’s: [B(S)/(M 1 )]*(M-1 +B(R)) or B(S) + [B(S)B(R)/(M 1 )] or approximately B(S)*B(R)/M

Two-Pass Algorithms Based on Sorting

SECTION 15.4

Rupinder Singh

Two-Pass Algorithms Based on Sorting

• •

Two-pass Algorithms: data from operand relations is read into main memory, then processed, written out to disk and then re-read from the disk to complete the operation In this section, we consider sorting as tool from implementing relational operations. The basic idea is as follows if we have large relation R, where B(R) is larger than M, the number of memory buffers we have available, then we can repeatedly

Basic idea

• • • Step 1: Read M blocks of R into main memory. Step 2:Sort these M blocks in main memory, using an efficient, main-memory sorting algorithm. so we expect that the time to sort will not exceed the disk 1/0 time for step (1).

Step 3: Write the sorted list into M blocks of disk.

Duplicate Elimination Using Sorting

δ

(R)

• • •

To perform δ (R) operation in two passes, we sort tuples of R in sublists. Then we use available memory to hold one block from each stored sublists and then repeatedly copy one to the output and ignore all tuples identical to it.

The total cost of this algorithm is 3B(R) This algorithm requires only √B(R)blocks of main memory, rather than B(R) blocks(one-pass algorithm).

Example

• • Suppose that tuples are integers, and only two tuples fit on a block. Also, M = 3 and the relation R consists of 17 tuples: 2,5,2,1,2,2,4,5,4,3,4,2,1,5,2,1,3 After first-pass

Sublists

R1 R2 R3

Elements

1,2,2,2,2,5 2,3,4,4,4,5 1,1,2,3,5

Example

• Second pass

Sublist

R1 R2 R3

In memory

1,2 2,3 1,1

Waiting on disk

2,2, 2,5 4,4, 4,5 2,3,5 After processing tuple 1

Sublist

R1

In memory

2 R2 2,3 Output: 1 R3 2,3 Continue the same process with next tuple.

Waiting on disk

2,2, 2,5 4,4, 4,5 5

• • • • •

Grouping and Aggregation Using Sorting

γ

(R)

Two-pass algorithm for grouping and aggregation is quite

similar to the previous algorithm.

Step 1:Read the tuples of R into memory, M blocks at a time. Sort each M blocks, using the grouping attributes of L as the sort key. Write each sorted sublist to disk.

Step 2:Use one main-memory buffer for each sublist, and initially load the first block of each sublist into its buffer.

Step 3:Repeatedly find the least value of the sort key (grouping attributes) present among the first available tuples in the buffers.

This algorithm takes 3B(R) disk 1/0's, and will work as long as B(R) < M².

A Sort-Based Union Algorithm

• • • For bag-union one-pass algorithm is used.

For set-union – Step 1:Repeatedly bring M blocks of R into main memory, sort their tuples, and write the resulting sorted sublist back to disk.

– Step 2:Do the same for S, to create sorted sublists for relation S.Step 3:Use one main-memory buffer for each sublist of R and S. Initialize each with the first block from the corresponding sublist.

– Step 4:Repeatedly find the first remaining tuple t among all the

buffers. Copy

t to the output. and remove from the buffers all copies of t

(if R and S are sets there should be at most two copies) This algorithm takes 3(B(R)+B(S)) disk 1/0's, and will work as long as

B(R)+B(S) < M².

• • • • •

Sort-Based Intersection and Difference

For both set version and bag version, the algorithm is same as that of set-union except that the way we handle the copies of a tuple t at the fronts of the sorted sublists.

For set intersection, output t if it appears in both R and

S.

For bag intersection, output t the minimum of the number of times it appears in R and in S.

For set difference, R-S, output t if and only if it appears in R but not in S.

For bag difference, R-S, output t the number of times it appears in R minus the number of times it appears in S.

A Simple Sort-Based Join Algorithm

Given relation R(x,y) and S(y,z) to join, and given M blocks of main memory for buffers, 1. Sort R, using a two phase, multiway merge sort, with y as the sort key.

2. Sort S similarly 3. Merge the sorted R and S. Generally we use only two buffers, one for the current block of R and the other for current block of S. The following steps are done repeatedly.

a. Find least value y of the join attributes Y that is currently at the front of the blocks for R and S.

b. If y doesn ’ t appear at the front of the other relation, then remove the tuples with sort key y.

c. Otherwise identify all the tuples from both relation having sort key y d. Output all tuples that can be formed by joining tuples from R and S with a common Y value y.

e. If either relation has no more unconsidered tuples in main memory reload buffer for the relation.

A More Efficient Sort-Based Join

• • If we do not have to worry about very large numbers of tuples with a common value for the join attribute(s), then we can save two disk 1/0's per block by combining the second phase of the sorts with the join itself To compute R(X, Y) ►◄ S(Y, Z) using M main-memory buffers – Create sorted sublists of size M, using Y as the sort key, for both

R and S.

– – Bring the first block of each sublist into a buffer Repeatedly find the least Y-value y among the first available tuples of all the sublists. Identify all the tuples of both relations that have Y-value y. Output the join of all tuples from R with all tuples from S that share this common Y-value

A More Efficient Sort-Based Join

• • The number of disk I/O’s is 3(B(R) + B(S)) It requires B(R) + B(S) ≤ M² to work

Summary of Sort-Based Algorithms Operators

γ,δ U, ∩,− ►◄ ►◄(more efficient)

Approximate M required

√B √(B(R) + B(S)) √(max(B(R),B(S))) √(B(R) + B(S))

Disk I/O

3B 3(B(R) + B(S)) 5(B(R) + B(S)) 3(B(R) + B(S))

Query Execution

15.5 Two-pass Algorithms based on Hashing

By Swathi Vegesna

At a glimpse

• • • • • • • • Introduction Partitioning Relations by Hashing Algorithm for Duplicate Elimination Grouping and Aggregation Union, Intersection, and Difference Hash-Join Algorithm Sort based Vs Hash based Summary

Introduction

Hashing is done if the data is too big to store in main memory buffers. – Hash all the tuples of the argument(s) using an appropriate hash key. – For all the common operations, there is a way to select the hash key so all the tuples that need to be considered together when we perform the operation have the same hash value.

– This reduces the size of the operand(s) by a factor equal to the number of buckets.

Partitioning Relations by Hashing

Algorithm: initialize M-1 buckets using M-1 empty buffers; FOR each block b of relation R DO BEGIN read block b into the Mth buffer; FOR each tuple t in b DO BEGIN IF the buffer for bucket h(t) has no room for t THEN BEGIN copy the buffer t o disk; initialize a new empty block in that buffer; END; copy t to the buffer for bucket h(t); END ; END ; FOR each bucket DO IF the buffer for this bucket is not empty THEN write the buffer to disk;

Duplicate Elimination

For the operation δ(R) hash R to M-1 Buckets.

• • (Note that two copies of the same tuple t will hash to the same bucket) Do duplicate elimination on each bucket R using one-pass algorithm

i

independently, The result is the union of δ(R that hashes to the ith bucket

i

), where R

i

is the portion of R

Requirements

• Number of disk I/O's: 3*B(R) • – B(R) < M(M-1), only then the two-pass, hash-based algorithm will work In order for this to work, we need: – hash function h evenly distributes the tuples among the buckets – each bucket R

i

fits in main memory (to allow the one-pass algorithm) – i.e., B(R) ≤ M

2

Grouping and Aggregation

 Hash all the tuples of relation R to M-1 buckets, using a hash function that depends only on the grouping attributes (Note: all tuples in the same group end up in the same bucket)   Use the one-pass algorithm to process each bucket independently Uses 3*B(R) disk I/O's, requires B(R) ≤ M

2

Union, Intersection, and Difference

• • • • • • For binary operation we use the same hash function to hash tuples of both arguments.

R U S we hash both R and S to M-1 R

S we hash both R and S to 2(M-1) R-S we hash both R and S to 2(M-1) Requires 3(B(R)+B(S)) disk I/O’s.

Two pass hash based algorithm requires min(B(R)+B(S))≤ M2

Hash-Join Algorithm

• Use same hash function for both relations; hash function should depend only on the join attributes • • • • Hash R to M-1 buckets R

1 , R 2 , …, R M-1

Hash S to M-1 buckets S

1

, S

2

, …, S

M-1

Do one-pass join of R

i

and S

i

, for all i 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M

2

Sort based Vs Hash based

• For binary operations, hash-based only limits size to min of arguments, not sum • Sort-based can produce output in sorted order, which can be helpful • Hash-based depends on buckets being of equal size • Sort-based algorithms can experience reduced rotational latency or seek time

Index-Based Algorithms

Chapter 15 Section 15.6

Presented by Fan Yang CS 257 Class ID218

371

Clustering and Nonclustering Indexes

A relation is “ clustered ” if its tuples are packed into roughly as few blocks as can possibly hold those tuples.

Clustering Indexes, which are indexes on an attribute or attributes such that all the tuples with a fixed value for the search key of this index appear on roughly as few blocks as can hold them. Note that a relation that isn't clustered cannot have a clustering index, but even a clustered relation can have nonclustering indexes.

A clustering index has all tuples with a fixed value packed into the minimum possible number of blocks

372

Clustering and Nonclustering Indexes

• A relation that isn’t clustered cannot have a clustering index • A clustered relation can have nonclustering indexes 373

Index-Based Selection

• • •

Selection on equality:

a=v (R) Clustered index on a: cost B(R)/V(R,a)

If the index on R.a is clustering, then the number of disk I/O's to retrieve the set

a = v (R) will average B(R)/V(R, a). The actual number may be somewhat higher.

Unclustered index on a: cost T(R)/V(R,a)

If the index on R.a is nonclustering, each tuple we retrieve will be on a different block, and we must access T(R)/V(R,a) tuples. Thus, T(R)/V(R, a) is an estimate of the number of disk I/O’s we need.

374

Index-Based Selection

• The actual number may be higher: 1. index is not kept entirely in main memory 2. they spread over more blocks 3. may not be packed as tightly as possible into blocks 375

Example

• • • • • B(R)=1000, T(R)=20,000 number of I/O’s required: 1. clustered, not index 2. not clustered, not index 1000 20,000 3. If V(R,a)=100, index is clustering 10 4. If V(R,a)=10, index is nonclustering 2,000 376

Joining by Using an Index

• Natural join R(X, Y) S S(Y, Z) Number of I/O’s to get R Clustered: B(R) Not clustered: T(R) Number of I/O’s to get tuple t of S Clustered: T(R)B(S)/V(S,Y) Not clustered: T(R)T(S)/V(S,Y) 377

Example

• R(X,Y): 1000 blocks S(Y,Z)=500 blocks Assume 10 tuples in each block, so T(R)=10,000 and T(S)=5000 V(S,Y)=100 If R is clustered, and there is a clustering index on Y for S the number of I/O’s for R is: 1000 the number of I/O’s for S is10,000*500/100=50,000 378

Joins Using a Sorted Index

• • • Natural join R(X, Y) S (Y, Z) with index on Y for either R or S Extreme case: Zig-zag join Example: relation R(X,Y) and R(Y,Z) with index on Y for both relations search keys (Y-value) for R: 1,3,4,4,5,6 search keys (Y-value) for S: 2,2,4,6,7,8 379

Chapter 15.7

Buffer Management

ID: 219 Name: Qun Yu Class: CS257 219 Spring 2009 Instructor: Dr. T.Y.Lin

What does a buffer manager do?

Central Task of making memory buffers available to processors is done with the help of buffer managers.

In practice: 1) rarely allocated in advance 2) the value of

M

may vary depending on system conditions Therefore,

buffer manager

is used to allow processes to get the memory they need, while minimizing the delay and unclassifiable requests.

The role of the buffer manager Read/Writes Requests Buffers

Buffer manager

Figure 1:

The role of the buffer manager : responds to requests for main-memory access to disk blocks

15.7.1 Buffer Management Architecture Two broad architectures for a buffer manager : 1) The buffer manager which controls main memory directly is Relational DBMS 2) The buffer manager allocates buffers in virtual memory, allowing the OS to decide how to use buffers. i.e

“main-memory” DBMS • “object-oriented” DBMS

Buffer Pool Key setting for the Buffer manager to be efficient: Problem: The buffer manager should limit

the number of buffers

in use so that they fit in the available main memory, i.e. Don ’t exceed available space.

The number of buffers is a parameter set when the DBMS is initialized. No matter which architecture of buffering is used, we simply assume that there is a fixed-size

buffer pool

, a set of buffers available to queries and other database actions.

Buffer Pool Page Requests from Higher Levels BUFFER POOL disk page free frame MAIN MEMORY DISK DB choice of frame dictated by replacement policy • •

Data must be in RAM for DBMS to operate on it!

Buffer Manager hides the fact that not all data is in RAM.

15.7.2 Buffer Management Strategies

Buffer-replacement strategies:

Critical choice the buffer manager has to make is when a buffer is needed for a newly requested block and the buffer pool is full then which block to throw out the buffer pool.

Buffer-replacement strategy -- LRU

Least-Recently Used (LRU):

To throw out the block that

has not been read or written

for the longest time. • Requires more maintenance but it is effective. • Update the time table for every access.

• Least-Recently Used blocks are usually less likely to be accessed sooner than other blocks .

Buffer-replacement strategy -- FIFO

First-In-First-Out (FIFO):

The buffer that

has been occupied

the longest

by the same block

is emptied and used for the new block.

• Requires less maintenance but it can make more mistakes.

• Keep only the loading time • The oldest block doesn’t mean it is less likely to be accessed.

Example: the root block of a B-tree index

Buffer-replacement strategy – “Clock”

The “Clock” Algorithm (“Second Chance”)

Think of the 8 buffers as arranged in a circle, shown as

Figure 3

Flag 0 and 1:  buffers with a 0 flag are ok to sent their contents back to disk, i.e. ok to be replaced  buffers with a 1 flag are not ok to be replaced

Buffer-replacement strategy – “Clock” 0 0 1 the buffer with a 0 flag will be replaced 0 0 0 1 1 Start point to search a 0 flag The flag will be set to 0 By next time the hand reaches it, if the content of this buffer is not accessed, i.e. flag=0, this buffer will be replaced.

That ’s “Second Chance”.

Figure 3: the clock algorithm

Buffer-replacement strategy -- Clock a buffer’s flag set to 1 when:  a block is read into a buffer  the contents of the buffer is accessed a buffer’s flag set to 0 when:  the buffer manager needs a buffer for a new block, it looks for the first 0 it can find, rotating clockwise. If it passes 1’s, it sets them to 0.

System Control helps Buffer-replacement strategy System Control The query processor or other components of a DBMS can give advice to the buffer manager in order to avoid some of the mistakes that would occur with a strict policy such as LRU,FIFO or Clock.

For example:

A “pinned” block means it can’t be moved to disk without first modifying certain other blocks that point to it. In FIFO, use “pinned” to force root of a B-tree to remain in memory at all times.

15.7.3 The Relationship Between Physical Operator Selection and Buffer Management Problem:  Physical Operator expected certain number of buffers M for execution.

 However, the buffer manager may not be able to guarantee these M buffers are available.

15.7.3 The Relationship Between Physical Operator Selection and Buffer Management Questions:  Can the algorithm adapt to changes of

M

, the number of main-memory buffers available ?

 When available buffers are less than

M

, and some blocks have to be put in disk instead of in memory. How the buffer-replacement strategy impact the performance (i.e. the number of additional I/O ’s)?

Example FOR

each chunk of M-1 blocks of S

DO BEGIN read these blocks into main-memory buffers; organize their tuples into a search structure whose search key is the common attributes of R and S; FOR

each block b of R

DO BEGIN read b into main memory; FOR each tuple t of b DO BEGIN find the tuples of S in main memory that join with t ; output the join of t with each of these tuples; END ; END ; END ; Figure 15.8: The nested-loop join algorithm

Example  The outer loop number (M-1) depends on the average number of buffers are available at each iteration.

 The outer loop use M-1 buffers and 1 is reserved for a block of R, the relation of the inner loop.

 If we

pin

the M-1 blocks we use for S on

one iteration

of the outer loop, we shall not lose their buffers during the round. Also, more buffers may become available and then we could keep more than one block of R in memory.

Will these extra buffers improve the running time?

Example CASE1: NO  Buffer-replacement strategy: LRU  Buffers for R: k  We read each block of R in order into buffers.

 By end of the iteration of the outer loop, the last k blocks of R are in buffers.

 However, next iteration will start from the beginning of R again.  Therefore, the k buffers for R will need to be replaced.

Example CASE 2: YES  Buffer-replacement strategy: LRU  Buffers for R: k  We read the blocks of R in an order that alternates:

first

last and then last

first.

 In this way, we save k disk I/Os on each iteration of the outer loop except the first iteration.

Other Algorithms and M buffers Other Algorithms also are impact by M and the buffer-replacement strategy.

 Sort-based algorithm If we use a sort-based algorithm for some operator, then it is possible to adapt to changes in M. If Af shrinks, we can change the size of a sublist, since the sort-based algorithms we discussed do not depend on the sublists being the same size. The major limitation is that as M shrinks, we could be forced to create so many sublists that we cannot then allocate a buffer for each sublist in the merging process.

.

• Hash Table • • • • • • • If the algorithm is hash-based, ive can reduce the number of buckets if shrinks, as long as the buckets do not then become so large that they do not fit in allotted main memory. However, unlike sort-based algorithms, we cannot respond to changes in A1 while the algorithm runs. Rather, once the number of buckets is chosen, it remains fixed throughout the first pass, and if buffers become unavailable, the blocks belonging to some of the buckets will have to be ST\-appedo ut .

Chapter 15 Query Execution 15.8 Algorithms using more than two passes

Presented by: Kai Zhu Professor: Dr. T.Y. Lin Class ID: 220

Reason that we use more than two passes: Two passes are usually enough, however, for the largest relation, we use as many passes as necessary.

• •

Multi-pass Sort-based Algorithms

Suppose we have M main-memory buffers available to sort a relation R, which we assume is stored clustered.

• Then we do the following: If R fits in M blocks (i.e., B(R)<=M) 1. Read R into main memory.

2. Sort it using any main-memory sorting algorithm.

3. Write the sorted relation to disk.

INDUCTION:

If R does not fit into main memory.

1. Partition the blocks holding R into groups, which we shall call R1, R2, R3 … 2. Recursively sort Ri for each i=1,2,3 … M.

M 3. Merge the M sorted sublists.

If we are not merely sorting R, but performing a unary operation such as δ or γ on R. We can modify the above so that at the final merge we perform the operation on the tuples at the front of the sorted sublists.

That is:

• • For a δ, output one copy of each distinct tuple, and skip over copies of the tuple.

For a γ, sort on the grouping attributes only, and combine the tuples with a given value of these grouping attributes.

Performance of Multipass, Sort-Based Algorithms

• • • • • • BASIS: If k = 1, i.e., one pass is allowed, then we must have B(R) < M. Put another way, s(M, 1) = Af.

INDUCTION: Suppose k > 1. Then we partition R into 1M pieces, each of which must be sortable in k - 1 passes. If B(R) = s(M, k), then s(M, k)/:l17 which is the size of each of the M pieces of R, cannot exceed s(M, k - 1). That is: s(M, k) = Ms(M, k - 1)

Multipass Hash-Based Algorithms

• • BASIS: For a unary operation, if the relation fits in hl buffers, read it into memory and perfor111 the operation. For a binary operation, if either relation fits in ,11 - I buffers, perform the operation by reading this relation into main memory and then read the second relation, one block at a time, into the Mth buffer.

• • INDUCTION: If no relation fits in main memory, then hash each relation into A 1 -1 buckets, as discussed in Section 15.5.1. Recursively perform the operation on each bucket or corresponding pair of buckets, and accumulate the output from each bucket or pair.

Chapter 15 Query Execution 15.8 Algorithms using more than two passes

Presented by: Kai Zhu Professor: Dr. T.Y. Lin Class ID: 220

Reason that we use more than two passes: Two passes are usually enough, however, for the largest relation, we use as many passes as necessary.

• •

Multi-pass Sort-based Algorithms

Suppose we have M main-memory buffers available to sort a relation R, which we assume is stored clustered.

• Then we do the following: If R fits in M blocks (i.e., B(R)<=M) 1. Read R into main memory.

2. Sort it using any main-memory sorting algorithm.

3. Write the sorted relation to disk.

INDUCTION:

If R does not fit into main memory.

1. Partition the blocks holding R into groups, which we shall call R1, R2, R3 … 2. Recursively sort Ri for each i=1,2,3 … M.

M 3. Merge the M sorted sublists.

If we are not merely sorting R, but performing a unary operation such as δ or γ on R. • • We can modify the above so that at the final merge we perform the operation on the tuples at the front of the sorted sublists.

That is: For a δ, output one copy of each distinct tuple, and skip over copies of the tuple.

For a γ, sort on the grouping attributes only, and combine the tuples with a given value of these grouping attributes.

Conclusion The two pass algorithms based on sorting or hashing have natural recursive analogs that take three or more passes and will work for larger amounts of data.

Performance of Multipass, Sort-Based Algorithms

• • • • • • BASIS: If k = 1, i.e., one pass is allowed, then we must have B(R) < M. Put another way, s(M, 1) = Af.

INDUCTION: Suppose k > 1. Then we partition R into 1M pieces, each of which must be sortable in k - 1 passes. If B(R) = s(M, k), then s(M, k)/:l17 which is the size of each of the M pieces of R, cannot exceed s(M, k - 1). That is: s(M, k) = Ms(M, k - 1)

Multipass Hash-Based Algorithms

• • BASIS: For a unary operation, if the relation fits in hl buffers, read it into memory and perfor111 the operation. For a binary operation, if either relation fits in ,11 - I buffers, perform the operation by reading this relation into main memory and then read the second relation, one block at a time, into the Mth buffer.

• • INDUCTION: If no relation fits in main memory, then hash each relation into A 1 -1 buckets, as discussed in Section 15.5.1. Recursively perform the operation on each bucket or corresponding pair of buckets, and accumulate the output from each bucket or pair.