Sections 13.1 – 13.3 - San Jose State University

Download Report

Transcript Sections 13.1 – 13.3 - San Jose State University

Cs257 Presentations Summary
By
Tharun Bandaru
ID : 122
SECONDARY STORAGE
MANAGEMENT
Sections 13.1 – 13.3
13.1.1 Memory Hierarchy
• Storage capacity for a data depends on the type of the data and hence
varies for data to data.
• Cost per byte to store data also varies
• Device with smallest capacity offer the fastest
speed with highest cost per bit
Memory Hierarchy Diagram
Programs,
DBMS
Main Memory DBMS’s
As Visual Memory
Tertiary Storage
Disk
Main Memory
Cache
File System
13.1.1 Memory Hierarchy
• Cache
– Lowest level of the hierarchy
– Data items are copies of certain locations of main memory
– Sometimes, values in cache are changed and
corresponding changes to main memory are delayed
– Machine looks for instructions as well as data for those
instructions in the cache
– Holds limited amount of data
– In the case of a single processor computer, there is no need to update
the data into the main memory immediately.
– In multiple processors data is updated immediately to main
memory….called as write through
Main Memory
• Refers to physical memory that is internal to the computer. The word main
is used to distinguish it from external mass storage devices such as disk
drives.
• Everything happens in the computer i.e.
instruction execution, data manipulation, as
working on information that is resident in
main memory
• Main memories are random access….one can
obtain any byte in the same amount of time
Secondary storage
• Used to store data and programs when they
are not being processed
• More permanent than main memory, as data
and programs are retained when the power is
turned off
• A personal computer might only require 20,000 bytes of secondary
storage
• E.g. magnetic disks, hard disks
Tertiary Storage
• consists of anywhere from one to several storage drives.
• It is a comprehensive computer storage system that is usually very slow,
so it is usually used to archive data that is not accessed frequently.
• Holds data volumes in terabytes
• Used for databases much larger than what can
be stored on disk
13.1.2 Transfer of Data Between levels
• Data moves between adjacent levels of the hierarchy
•
At the secondary or tertiary levels accessing the desired data or finding the desired
place to store the data takes a lot of time
• Disk is organized into bocks
• Entire blocks are moved to and from memory called a
buffer
•
A key technique for speeding up database operations is to arrange the data so that
when one piece of data block is needed it is likely that other data on the same block will
be needed at the same time
• Same idea applies to other hierarchy levels
13.1.3 Volatile and Non Volatile
Storage
• A volatile device forgets what data is stored on
it after power off
• Non volatile holds data for longer period even
when device is turned off
• Secondary and tertiary devices are non volatile
• Main memory is volatile
13.1.4 Virtual Memory
•
•
computer system technique which gives an application program the impression that it
has contiguous working memory (an address space), while in fact it may be physically
fragmented and may even overflow on to disk storage
technique make programming of large applications easier and use real physical
memory (e.g. RAM) more efficiently
• Typical software executes in virtual memory
• Address space is typically 32 bit or 232 bytes or 4GB
• Transfer between memory and disk is in terms of
blocks
13.2.1 Mechanism of Disk
• Mechanisms of Disks
– Use of secondary storage is one of the important
characteristic of DBMS
– Consists of 2 moving pieces of a disk
• 1. disk assembly
• 2. head assembly
– Disk assembly consists of 1 or more platters
– Platters rotate around a central spindle
– Bits are actually stored in the upper and lower side of the platters.
13.2.1 Mechanism of Disk
• Disk is organized into tracks
• Tracks that are at fixed radius from the center
forms a cylinder.
• Tracks are organized into sectors
• Tracks are the segments of circle separated by
gap
13.2.2 Disk Controller
• Disk Controllers can control one of more
number of disks.
• Disks controllers are capable of
– Selecting the sector from among all those in the
cylinder at which heads are positioned
– Transferring bits between desired sector and main
memory
– Possible buffering an entire track.
– Controls the mechanical actuator whose job is to
move the head assembly.
13.2.3 Disk Access Characteristics
• Accessing (reading/writing) a block requires 3
steps
– Disk controller positions the head assembly at the
cylinder containing the track on which the block is
located. It is a ‘seek time’
– The disk controller waits while the first sector of
the block moves under the head. This is a
‘rotational latency’
– The ‘transfer time’ relates to the sectors in which the disk controller
reads or write the data into the gaps that occur when all the sectors
pass the head.
13.3 Accelerating Access to Secondary
Storage
Several approaches for more-efficiently accessing
data in secondary storage:
Place blocks that are together in the same cylinder.
Divide the data among multiple disks.
Mirror disks.
Use disk-scheduling algorithms.
Prefetch blocks into main memory.
Scheduling Latency – added delay in accessing
data caused by a disk scheduling algorithm.
Throughput – the number of disk accesses per
second that the system can accommodate.
13.3.1 The I/O Model of Computation
The number of block accesses (Disk I/O’s) is a
good time approximation for the algorithm.
Ex 13.3: You want to have an index on R to
identify the block on which the desired tuple
appears, but not where on the block it
resides.
For Megatron 747 (M747) example, it takes 11ms
to read a 16k block.
 The incurred delay in searching for the desired tuple is almost
negligible.
13.3.2 Organizing Data by Cylinders
 Here, first ‘seek time’ and the first ‘rotational
latency’ can never be neglected.
Ex 13.4: We request 1024 blocks of M747.
If data is randomly distributed, average latency is
10.76ms by Ex 13.2, making total latency 11s.
If all blocks are consecutively stored on 1 cylinder:
6.46ms + 8.33ms * 16 = 139ms
(1 average seek)
(time per rotation)
(# rotations)
13.3.3 Using Multiple Disks
 The number of disks are proportional to the factor of the ‘performance’.
Striping – distributing a relation across multiple
disks following this pattern:
Data on disk R1: R1, R1+n, R1+2n,…
Data on disk R2: R2, R2+n, R2+2n,…
…
Data on disk Rn: Rn, Rn+n, Rn+2n, …
 Ex 13.5: We request 1024 blocks with n = 4.
6.46ms + (8.33ms * (16/4)) = 39.8ms
(1 average seek) (time per rotation) (# rotations)
13.3.4 Mirroring Disks
Mirroring Disks – having 2 or more disks hold
Benefit 1: If n disks are mirrors of each other,
the system can survive a crash by n-1 disks.
Benefit 2: If we have n disks, read
performance increases by a factor of n.
 Increase in performance => Increase in efficiency.
13.3.5 Disk Scheduling and the
Elevator Problem
Disk controller will run this algorithm to select
which of several requests to process first.
Pseudo code:
requests[] // array of all non-processed data
requests
upon receiving new data request:
requests[].add(new request)
while(requests[] is not empty)
move head to next location
if(head reaches end)
 reverse head direction
13.3.5 Disk Scheduling and the
Elevator Problem (con’t)
Events:
Head starting point
Request data at 8000
Request data at 24000
Request data at 56000
Get data at 8000
Request data at 16000
Get data at 24000
Request data at 64000
Get data at 56000
Request Data at 40000
Get data at 64000
Get data at 40000
Get data at 16000
64000
56000
48000
40000
32000
24000
16000
8000
Current time
13.6
26.9
34.2
45.5
56.8
4.3
10
20
30
0
data
time
8000..
4.3
24000..
13.6
56000..
26.9
64000..
34.2
40000..
45.5
16000..
56.8
13.3.5 Disk Scheduling and the
Elevator Problem (con’t)
Elevator
Algorithm
data
time
FIFO
Algorithm
data
time
8000..
4.3
8000..
4.3
24000..
13.6
24000..
13.6
56000..
26.9
56000..
26.9
64000..
34.2
16000..
42.2
40000..
45.5
64000..
59.5
16000..
56.8
40000..
70.8
13.3.6 Pre-fetching and Large-Scale
Buffering
If at the application level, we can predict the
order blocks will be requested, we can load
them into main memory before they are
needed.
 By doing so , the cost can be reduced at the same time.
13.4.Disk Failures
• Intermittent Error: Read or write is unsuccessful.
If we try to read the sector but the correct content of that
sector is not delivered to the disk controller. Check for the
good or bad sector. To check write is correct: Read is
performed.
The read operation knows whether a sector is a good or a bad
one.
• Checksums: Each sector has some additional bits, called the
checksums. They are set on the depending on the values of
the data bits stored in that sector. Probability of reading bad
sector is less if we use checksums. For Odd parity: Odd
number of 1’s, add a parity bit 1. For Even parity: Even
number of 1’s, add a parity bit 0. So, number of 1’s becomes
always even.
• Stable -Storage Writing Policy:
To recover the disk failure known as Media Decay, in which if
we overwrite a file, the new data is not read correctly. Sectors
are paired and each pair is said to be X, having left and right
copies as Xl and Xr respectively and check the parity bit of left
and right by substituting spare sector of Xl and Xr until the
good value is returned.
• The term used for these strategies is RAID or Redundant
Arrays of Independent Disks.
• Mirroring:
Mirroring Scheme is referred as RAID level 1 protection
against data loss scheme. In this scheme we mirror each disk.
One of the disk is called as data disk and other redundant
disk.
In this technique, if there is a data loss, then the only possible
reason could be the occurrence of a second crash while
repairing the first crash.
• Parity Blocks:
RAID level 4 scheme uses only one redundant disk no matter
how many data disks there are. In the redundant disk, the ith
block consists of the parity checks for the ith blocks of all the
data disks. It means, the jth bits of all the ith blocks of both
data disks and redundant disks, must have an even number of
1’s and redundant disk bit is used to make this condition true.
• Failures: If out of Xl and Xr, one fails, it can be read form other, but
in case both fails X is not readable, and its probability is very small
• Write Failure: During power outage,
•
1. While writing Xl, the Xr, will remain good and X can be read
from Xr
•
2. After writing Xl, we can read X from Xl, as Xr may or may not
have the correct copy of X.
Recovery from Disk Crashes:
• To reduce the data loss by Dish crashes, schemes which involve
redundancy, extending the idea of parity checks or duplicate sectors
can be applied.
• Parity Block – Writing
When we write a new block of a data disk, we need to change that
block of the redundant disk as well.
• One approach to do this is to read all the disks and compute the
module-2 sum and write to the redundant disk.
Total number of operation:
‘n-1’ operations to ‘read’
1 operation to write a data block
1 operation to write a ‘redundant disk block’
So , total = (n+1) disk IOs
RAID 5
RAID 4 is effective in preserving data unless there are two
simultaneous disk crashes.
Whatever scheme we use for updating the disks, we need to
read and write the redundant disk's block. If there are n data
disks, then the number of disk writes to the redundant disk
will be n times the average number of writes to any one data
disk.
However we do not have to treat one disk as the redundant
disk and the others as data disks. Rather, we could treat each
disk as the redundant disk for some of the blocks. This
improvement is often called RAID level 5.
13.5 Arranging data on disk
• Data elements are represented as records, which stores in
consecutive bytes in same disk block.
Basic layout techniques of storing data :
Fixed-Length Records
Allocation criteria - data should start at word boundary.
Fixed Length record header
1. A pointer to record schema.
2. The length of the record.
3. Timestamps to indicate last modified or last read.
Example
CREATE TABLE employee(
name CHAR(30) PRIMARY KEY,
address VARCHAR(255),
gender CHAR(1),
birthdate DATE
);
Data should start at word boundary and contain header and four
fields name, address, gender and birthdate.
• Packing Fixed-Length Records into Blocks
Initially, the records are stored in the form of blocks on the disk.
When we need to update them, they are moved into main
memory.
A block header is written first, and it is followed by series of
blocks.
Block header contains the following information:
Links to one or more blocks that are part of a network of
blocks.
Information about the role played by this block in such a
network.
Information about the relation, the tuples in this block belong
to.
A "directory" giving the offset of each record in the block.
Time stamp(s) to indicate time of the block's last modification
and/or access
Along with the header we can pack as many record as we can
Along with the header we can pack as many record as we can
in one block as shown in the figure and remaining space will
be unused.
13.6 Representing Block and Record Addresses
• Address of a block and Record
– In Main Memory
• Address of the block is the virtual memory address of
the first byte
• Address of the record within the block is the virtual
memory address of the first byte of the record
– In Secondary Memory: sequence of bytes describe the
location of the block in the overall system
• Sequence of Bytes describe the location of the block : the
device Id for the disk, Cylinder number, etc.
• Addresses in Client-Server Systems
• The addresses in address space are represented in two ways
– Physical Addresses: byte strings that determine the place
within the secondary storage system where the record can
be found.
– Logical Addresses: arbitrary string of bytes of some fixed
length
• Physical Address bits are used to indicate:
– Number of the tracks
– Host to which the storage is attached
– Identifier for the disk
– Number of the cylinder
– Offset of the beginning of the record
• Map Table relates logical addresses to physical addresses.
Logical
Physical
Logical Address
Physical Address
• Logical and Structured Addresses
Purpose of logical address?
Gives more flexibility, when we
– Move the record around within the block
– Move the record to another block
Gives us an option of deciding what to do when a record is
deleted?
• Pointer Swizzling
Having pointers is common in an object-relational database
systems
Important to learn about the management of pointers
Every data item (block, record, etc.) is characterized by two
addresses:
– i) database address : address on the disk
– Ii) memory address, if the data item is in virtual memory
• Translation Table: Maps database address to
memory address
Dbaddr
Mem-addr
Database address
Memory Address
• All addressable items in the database have entries
in the map table, while only those items currently
in memory are mentioned in the translation table
• Pointer consists of the following two fields
– Bit indicating the type of address
– Database or memory address
– Example 13.17
Disk
Memory
Swizzled
Block 1
Block 1
Unswizzled
Block 2
• Example 13.7
Block 1 has a record with pointers to a second record on the
same block and to a record on another block
If Block 1 is copied to the memory
– The first pointer which points within Block 1 can be
swizzled so it points directly to the memory address of the
target record
– Since Block 2 is not in memory, we cannot swizzle the
second pointer
• Three types of swizzling
– Automatic Swizzling
– Swizzling on Demand
• Only swizzle a pointer if and when it is actually
followed.
– No Swizzling
• Pointers are not swizzled they are accesses using the
database address.
• Unswizzling
– When a block is moved from memory back to disk, all
pointers must go back to database (disk) addresses
– Use translation table again
– Here, it is always important to have an efficient data
structure for the translation table
• Pinned records and Blocks
• A block in memory is said to be pinned if it cannot be written
back to disk safely.
• If block B1 has swizzled pointer to an item in block B2, then B2
is pinned
– Unpin a block, we must unswizzle any pointers to it
– Keep in the translation table the places in memory holding
swizzled pointers to that item
– Unswizzle those pointers by using the translation table to
replace the memory addresses with database (disk)
addresses.
13.7 Records With Variable-Length Fields
A simple but effective scheme is to put all fixed length
fields ahead of the variable-length fields. We then place
in the record header:
1. The length of the record.
2. Pointers to (i.e., offsets of) the beginnings of all the variablelength fields. However, if the variable-length fields always
appear in the same order then the first of them needs no
pointer; we know it immediately follows the fixed-length
fields.
• Records With Repeating Fields
• A similar situation occurs if a record contains a variable
number of Occurrences of a field F, but the field itself is of
fixed length. It is sufficient to group all occurrences of field F
together and put in the record header a pointer to the first.
• We can locate all the occurrences of the field F as follows.
Let the number of bytes devoted to one instance of field F be
L. Now, we add to the offset for the Field F, all integer multiples
of L, starting at 0, then L, 2L, 3L, and so on.
• Eventually, we reach the offset of the field following F.
Where upon we stop.
An alternative representation is to keep the record of fixed
length, and put the variable length portion - be it fields of
variable length or fields that repeat an indefinite number of
times - on a separate block. In the record itself we keep:
• 1. Pointers to the place where each repeating field begins,
and
• 2. Pointers that point to the end o the repetition.
• Variable-Format Records
• The simplest representation of variable-format records is a
sequence of tagged fields, each of which consists of:
1. Information about the role of this field, such as:
(a) The attribute or field name,
(b) The type of the field, if it is not apparent from the field
name and some readily available schema information, and
(c) The length identifier, in a more specific way.
2. The value of the field.
There are at least two reasons why tagged fields would make
sense.
1. Applications of Information Integration : - Sometimes, a
relation will be constructed from several earlier sources, and
these sources have different kinds of information.
2. Records with a fully flexible schema - For example, medical
records may contain information about many tests, but there
are thousands of possible tests, and each patient has results
for relatively few of them
• These large values have a variable length, but even if the length is
fixed for all values of the type, we need to use some special
techniques to represent these values. In this section we shall
consider a technique called “spanned records" that can be used to
manage records that are larger than blocks.
• Spanned records also are useful in situations where records are
smaller than blocks, but packing whole records into blocks wastes
significant amounts of space.
For both these reasons, it is sometimes desirable to allow records
to be split across two or more blocks. The portion of a record that
appears in one block is called a record fragment.
The constraint for a record to be spanned is that the record
fragment requires some extra header information.
1. Each record or fragment header must contain a bit telling
whether or not it is a fragment.
2. If it is a fragment, then it needs bits telling whether it is the
first or last fragment for its record.
3. Fragments of records, if any, point to the other fragments of
the other records.
Storing spanned records across blocks:
• BLOBS
• Binary, Large OBjectS = BLOBS
• BLOBS can be images, movies, audio files and other very large
values that can be stored in files.
• Storing BLOBS
– Stored in several blocks.
– It is always recommended to store the blobs consecutively to
obtain efficiency while retrieving.
• Retrieving BLOBS
– A client retrieving a 2 hour movie may not want it all at the
same time.
– Retrieving a specific part of the large data requires an index
structure to make it efficient. (Example: An index by seconds
on a movie BLOB.)
• Column Stores:
• An alternative to storing tuples as records is to store each
column as a record. Since an entire column of a relation may
occupy far more than a single block, these records may span
many block, much as long as files do. If we keep the values in
each column in the same order then we can reconstruct the
relation from column records
13.8
• Insertion:
 Insertion of records without order
Records can be placed in a block with empty space or in a
new block.
Insertion of records in fixed order
– Space available in the block
– No space available in the block (outside the block)
Structured address
Pointer to a record from outside the block.
• Insertion in fixed order
Space available within the block
Can be defined as ‘Usage of an offset table in the header of each
block, with pointers to the location of each record in the
block’.
• The records are slid within the block and the pointers in the
offset table are adjusted.
No space available within the block (outside the block)
 Find space on a “nearby” block.
• In case of no space available on a block, look at the
following block in sorted order of blocks.
• If space is available in that block ,move the highest
records of first block 1 to block 2 and slide the records
around on both blocks.
 Create an overflow block
• Records can be stored in overflow block.
• The overflow block to which a pointer is pointing can
indeed point to another overflow block.
• Deletion:
 Recover space after deletion
When using an offset table, the records can be slid around
the block so there will be an unused region in the center
that can be recovered.
In case we cannot slide records, an available space list can be
maintained in the block header.
The list head goes in the block header and available regions
hold the links in the list.

Use of tombstone

The tombstone is placed in a record in order to avoid
pointers to the deleted record to point to new records.
Until the entire database is reconstructed, the tombstone
can remain permanent.
If pointers go to fixed locations from which the location of the
record is found then we put the tombstone in that fixed
location. (See examples)
Where a tombstone is placed depends on the nature of the record
pointers.
Map table is used to translate logical record address to physical
address.
• UPDATING RECORDS
• In Fixed-Length Records, there is no effect on the
storage system
• For variable length records :
• If length increases, like insertion, we “slide the
records”
• If length decreases, like deletion, we update the
space-available list, recover the space/eliminate the
overflow blocks.
Chapter:18
• 18.1 Serial and Serializable Schedule
• A process of assuming that the transactions preserve the
consistency when executing simultaneously is called
Concurrency Control.
• This consistency is taken care by a Scheduler.
• Concurrency control in database management systems
(DBMS) ensures that database transactions are performed
concurrently without the concurrency violating the data
integrity of a database.
• Executed transactions should follow the ACID rules. The
DBMS must guarantee that only serializable (unless
Serializability is intentionally relaxed), recoverable schedules
are generated.
• It also guarantees that no effect of committed transactions is lost,
and no effect of aborted (rolled back) transactions remains in the
related database.
• ACID rules
• Atomicity - Either the effects of all or none of its operations remain
when a transaction is completed - in other words, to the outside
world the transaction appears to be indivisible, atomic.
• Consistency - Every transaction must leave the database in a
consistent state.
•
• Isolation - Transactions cannot interfere with each other. Providing
isolation is the main goal of concurrency control.
• Durability - Successful transactions must persist through crashes.


In the field of databases, a schedule is a list of actions, (i.e.
reading, writing, aborting, committing), from a set of
transactions.
In this example, Schedule D is the set of 3 transactions T1, T2,
T3. The schedule describes the actions of the transactions as
seen by the DBMS. T1 Reads and writes to object X, and then
T2 Reads and writes to object Y, and finally T3 Reads and
writes to object Z. This is an example of a serial schedule,
because the actions of the 3 transactions are not interleaved.
• Serial and Serializable Schedules:
• A schedule that is equivalent to a serial schedule has the
serializability property.
• In schedule E, the order in which the actions of the
transactions are executed is not the same as in D, but in the
end, E gives the same result as D.
T1
T2
READ (A,t)
t := t+100
WRITE (A,t)
A
B
50
50
150
READ (A,s)
s := s*2
WRITE (A,s)
READ (B,t)
t := t+100
WRITE (B,t)
300
150
READ (B,s)
s := s*2
WRITE (B,s)
300
• Non-Serializable Schedule
T1
T2
READ (A,t)
t := t+100
WRITE (A,t)
B
50
50
150
READ (A,s)
s := s*2
WRITE (A,s)
READ (B,s)
s := s*2
WRITE (B,s)
READ (B,t)
t := t+100
WRITE (B,t)
A
300
100
200
• A Serializable Schedule with details
T1
T2
READ (A,t)
t := t+100
WRITE (A,t)
B
50
50
150
READ (A,s)
s := s*1
WRITE (A,s)
READ (B,s)
s := s*1
WRITE (B,s)
READ (B,t)
t := t+100
WRITE (B,t)
A
150
50
150
18.2 Conflict Serializability
• Non-Conflicting Actions
• Two actions are non-conflicting if whenever they
occur consecutively in a schedule, swapping them
does not affect the final state produced by the
schedule. Otherwise, they are conflicting.
• Conflicting Actions: General Rules
– r1(A) w1(B)
• Two actions over the same database element conflict, if one
of them is a write
– r1(A) w2(A)
– w1(A) w2(A)
• Conflict Serializable:
We may take any schedule and make as many nonconflicting
swaps as we wish.
With the goal of turning the schedule into a serial schedule.
By doing so, the original schedule could be serialized as the effect
of the database state remains the same as we perform each
of the non-conflicting swaps.
• A schedule is said to be conflict-serializable when the
schedule is conflict-equivalent to one or more serial
schedules.
• Another definition for conflict-serializability is that a schedule
is conflict-serializable if and only if there exists an acyclic
precedence graph/serializability graph for the schedule.
• Which is conflict-equivalent to the serial schedule <T1,T2>,
but not <T2,T1>.
• Conflict equivalent / conflict-serializable
• Let Ai and Aj are consecutive non-conflicting actions that
belongs to different transactions. We can swap Ai and Aj
without changing the result.
Two schedules are conflict equivalent if they can be turned
one into the other by a sequence of non-conflicting swaps of
adjacent actions.
We shall call a schedule conflict-serializable if it is conflictequivalent to a serial schedule
Test for conflict-serializability
• Construct the precedence graph for S and observe if there are any
cycles.
– If yes, then S is not conflict-serializable
– Else, it is a conflict-serializable schedule.
• Example of a cyclic precedence graph:
– Consider the below schedule
S1: r2(A); r1(B); w2(A); r2(B); r3(A); w1(B); w3(A); w2(B);
• Observing the actions of A in the previous example (figure 2), we can
find that T2 <s1 T3.
• But when we observe B, we get both T1 <s1 T2 and T2 <s1 T1. Thus
the graph has a cycle between 1 and 2. So, based on this fact we
can conclude that S1 is not conflict-serializable
• Why the Precedence-Graph test works
• A cycle in the graph puts too many constraints on the order of
transactions in a hypothetical conflict-equivalent serial schedule.
• If there is a cycle involving n transactions T1 T2 ..Tn T1
– The actions must follow the order. i.e . The actions of T1 must
precede those of T2 which would precede those of T3 … up to n.
– But actions of Tn are also required to precede those of T1.
– So, if there is a cycle in the graph, then we can conclude that the
schedule is not conflict-serializable.
18.3 Enforcing Serializability by Locks:
• Locks
•
It works as follows :
– A request from transaction
– Scheduler checks in the lock table
– Generates a serializable schedule of actions.
• Consistency of transactions
• Actions and locks must relate each other
– Transactions can only read & write only if has a lock and
has not released the lock.
– Unlocking an element is compulsory.
• Legality of schedules
– No two transactions can aquire the lock on same element
without the prior one releasing it.
• Locking scheduler-
Grants lock requests only if it is in a legal schedule.
Also, a Lock table stores the information about
current locks on the elements.
• A legal schedule of consistent transactions but
unfortunately it is not a serializable.
• The locking scheduler delays requests that
would result in an illegal schedule.
• Two-phase locking
• Guarantees a legal schedule of consistent transactions is
conflict-serializable.
• All lock requests proceed all unlock requests.
• The growing phase:
– Obtain all the locks and no unlocks allowed.
• The shrinking phase:
– Release all the locks and no locks allowed.
• Failure of 2PL
Two Phase Locking cannot provide security against protocols.
• Shared & Exclusive Locks:
• Consistency of Transactions
– Cannot write without Exclusive Lock
– Cannot read without holding some lock
• This basically works on these principles –
1. Consistency of Transactions
– A read action can only proceed a shared or an exclusive
lock
– A write lock can only proceed a exclusive lock
– All locks need to be unlocked before commit
2. Two-phase locking of transactions
– Locking Must precede unlocking
3. Legality of Schedules
– An element may be locked exclusively by one transaction
or by several in shared mode, but not both
• Compatibility Matrices:
• Has a row and column for each lock mode.
– Rows correspond to a lock held on an element by another
transaction
– Columns correspond to mode of lock requested.
– Example :The column for S says that we can grant a shared
lock on an element if the only locks held on that element
currently are shared locks.
• Upgrading Locks
• Suppose a transaction wants to read as well as write :
– Firstly, It acquires a shared lock on the element
– Performs the calculations on the element
– And when its ready to write, It is granted a exclusive lock.
• Transactions with unpredicted read write locks can use
upgrading locks.
• Indiscriminating use of upgrading produces a deadlock.
(Limitation)
• Example : Both the transactions want to upgrade on the same
element
18.4 Locking Systems with Several Lock
Modes
• There are basically two locking schemes:
•
•
•
•
– Shared/Read Lock
( For Reading)
– Exclusive/Write Lock
( For Writing)
Compatibility Matrices
Upgrading Locks
Update Locks
Increment Locks
79
Shared & Exclusive Locks
• Consistency of Transactions
– Cannot write without Exclusive Lock
– Cannot read without holding some lock
• This basically works on 2 principles
– A read action can only proceed a shared or an exclusive
lock
– A write lock can only proceed a exclusice lock
• All locks need to be unlocked before commit
operation.
80
Shared and exclusive locks (cont.)
• Two-phase locking of transactions
– Must precede unlocking
• Legality of Schedules
– An element may be locked exclusively by one transaction or
by several in shared mode, but not both.
81
Compatibility Matrices
• Has a row and column for each lock mode.
– Rows correspond to a lock held on an element by
another transaction
– Example :
LOCK REQUESTED
S
X
LOCK
S
YES
NO
HOLD
X
NO
NO
82
Update locks
• Solves the deadlock occurring in upgrade lock
method.
• A transaction in an update lock can read but
cant write.
• Update lock can later be converted to
exclusive lock.
• Issue of an ‘update lock’ depends if the
element has shared locks.
83
Increment Locks
• Used for incrementing & decrementing stored
values.
• E.g. - Transfer money from one bank to
another, Ticket selling transactions in which
number seats are decremented after each
transaction.
84
• Increment lock
• A increment lock does not enable either read or write locks on
element.
• Any number of transaction can hold increment lock on an
element at any time.
• Shared and exclusive locks cannot be granted if an increment
lock is granted on element
85
18.5 Locking Scheduler
The order in which the individual steps of different
transactions occur is regulated by the scheduler.
The general process of assuring that transactions
preserve consistency when executing simultaneously is
called concurrency control.
• Architecture of a Locking Scheduler
The transactions themselves do not request locks, or
cannot be relied upon to do so. It is the job of the
scheduler to insert lock actions into the stream of
reads, writes and other actions that access data.
Transactions do not locks. Rather the scheduler releases the locks
when the transaction manager tells it that the transaction will
commit or abort.
Role of a Scheduler
Lock Table
• Lock Table
The lock table is a relation that associates database elements
with locking information about that element.
The table is implemented with a hash table using database
elements as a hash key.
• The size of the Lock table is proportional to the number of
locked elements only and not to the entire size of the
database . This is because any element that is not locked does
not appear in the table.
• Group Mode
The group mode is a summary of the most stringent
conditions that a transaction requesting a new lock on an
element faces. Rather than comparing the lock request with
every lock held by another transaction on the same element,
we can simplify the grant/deny decision by comparing the
request with only the group mode.
Structure of Lock Table Entries
• Handling Lock Requests
• Suppose transaction T requests a lock on A.
If there is no lock-table entry for A, then surely there are no
locks on A, so the entry is created and the request is granted.
• If there exists an entry for lock-table A, then we use it to guide
the decision about the lock request.
• Handling Unlocks
• If the value of waiting is ‘Yes’ then we need to grant one or
more locks from the list of requested locks. The different
approaches for this are:
• First-come-first-served
• Priority to shared locks
• Priority to upgrading
18.6 Managing Hierarchies of Database Elements
It Focus on two problems that come up when there id tree
structure to our data.
1. Tree Structure : Hierarchy of lockable elements. And
How to allow locks on both large elements, like
Relations and elements in it such as blocks and tuples
of relation, or individual.
2. Another is data that is itself organized in a tree. A major
example would be B-tree index.
Locks With Multiple Granularity
“Database Elements” : It is sometime noticeably the
various elements which can be used for locking.
Eg: Tuples, Pages or Blocks, Relations etc.
Example: Bank database

Small granularity locks: Larger concurrency can achieved.

Large granularity locks: Some times saves from unserializable
behavior.
Warning locks
The solution to the problem of managing locks at different
granularities involves
a new kind of lock called a “Warning.“
• It is helpful in hierarchical or nested structure .
• It involves both “ordinary” locks and “warning” locks.
•Ordinary locks: Shared(S) and Exclusive(X) locks.
•Warning locks: Intention to shared(IS) and Intention to
Exclusive(IX) locks.
• Warning Protocols
• 1. To place an ordinary S or X lock on any element. we must
•
begin at the root of the hierarchy.
• 2. If we are at the element that we want to lock, we need look
no further. We request lock there only
• 3. If the element is down in hierarchy then place warning lock
on that node respective of shared and exclusive locks and
then Move on to appropriate child and then try steps 2 or 3
and until you go to desired node and then request shared or
exclusive lock.
Compatibility Matrix
IS
IX
S
X
IS
YES
YES
YES
NO
IX
YES
YES
NO
NO
S
YES
NO
YES
NO
X
NO
NO
NO
NO
IS column: Conflicts only on X lock.
IX column: Conflicts on S and X locks.
S column: Conflicts on X and IX locks.
X column: Conflicts every locks.
Warning Protocols
Consider the relation:
M o v i e ( t i t l e , year, length, studioName)
Transaction1 (T1):
SELECT *
FROM Movie
WHERE title = 'King Kong';
Transaction2(T2):
UPDATE Movie
SET year = 1939
WHERE title = 'Gone With the Wind';
18.7 The Tree Protocol
•
•
•
•
•
ADVANTAGES OF TREE PROTOCOL
Unlocking takes less time as compared to 2PL
Freedom from deadlocks
18.7.1 MOTIVATION FOR TREE-BASED LOCKING
Consider B-Tree Index, treating individual nodes as lockable
database elements.
• Concurrent use of B-Tree is not possible with standard set of
locks and 2PL.
• Therefore, a protocol is needed which can assure
serializability by allowing access to the elements all the way at
the bottom of the tree even if the 2PL is violated
18.7.2 ACCESSING TREE STRUCTURED DATA
Assumptions:
• Only one kind of lock
• Consistent transactions
• Legal schedules
• No 2PL requirement on transaction
Rules:
• First lock can be at any node.
• Subsequent locks may be acquired only after parent node has
a lock.
• Nodes may be unlocked any time.
• No relocking of the nodes even if the node’s parent is still
locked
18.7.3 WHY TREE PROTOCOL WORKS?
• Tree protocol implies a serial order on transactions in the
schedule.
• Order of precedence:
Ti < s Tj
• If Ti locks the root before Tj, then Ti locks every node in
common with Tj before Tj.
ORDER OF PRECEDENCE
18.8 Concurrency control by Timestamps
• What is Timestamping?
• Scheduler assign each transaction T a unique number, it’s
timestamp TS(T).
• Timestamps must be issued in ascending order, at the time
when a transaction first notifies the scheduler that it is
beginning.
• Two methods of generating Timestamps.
– Use the value of system, clock as the timestamp.
– Use a logical counter that is incremented after a new
timestamp has been assigned.
• Irrespective of the method used, the Scheduler maintains a
table of currently active transactions and their timestamps.
• Timestamps for database element X and commit bit
• RT(X):- The read time of X, which is the highest timestamp of
transaction that has read X.
• WT(X):- The write time of X, which is the highest timestamp of
transaction that has write X.
• C(X):- The commit bit for X, which is true if and only if the
most recent transaction to write X has already committed.
• Physically Unrealizable Behavior
Read too late:
• A transaction U that started after transaction T, but wrote a
value for X before T reads X.
U reads X
T writes X
T start
U start
Figure: Transaction T tries to write too late
Physically Unrealizable Behavior
Write too late
• A transaction U that started after T, but read X before T got a
chance to write X.
• Dirty Read
• It is possible that after T reads the value of X written by U,
transaction U will abort.
U reads X
T writes X
T start
U start
Figure: Transaction T tries to write too late
U writes X
T reads X
U start
T start
U aborts
T could perform a dirty read if it reads X when shown
Rules for Timestamps-Based scheduling
1. Scheduler receives a request rT(X)
a) If TS(T) ≥ WT(X), the read is physically realizable.
1. If C(X) is true, grant the request, if TS(T) > RT(X), set
RT(X) := TS(T); otherwise do not change RT(X).
2. If C(X) is false, delay T until C(X) becomes true or
transaction that wrote X aborts.
b) If TS(T) < WT(X), the read is physically unrealizable.
Rollback T.
2. Scheduler receives a request WT(X).
a) if TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is physically
realizable and must be performed.
1. Write the new value for X,
2. Set WT(X) := TS(T), and
3. Set C(X) := false.
b) if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is physically
realizable, but there is already a later values in X.
a. If C(X) is true, then the previous writers of X is
committed, and ignore the write by T.
b. If C(X) is false, we must delay T.
c) if TS(T) < RT(X), then the write is physically unrealizable,
and T must be rolled back.
3. Scheduler receives a request to commit T. It must find all the
database elements X written by T and set C(X) := true. If any
transactions are waiting for X to be committed, these
transactions are allowed to proceed.
4. If the Scheduler receives a request to abort T or decides to
rollback T, then any transaction that was waiting on an
element X that T wrote must repeat its attempt to read or
write.
• Multiversion Timestamps
• Multiversion schemes keep old versions of data item to increase
concurrency.
• Each successful write results in the creation of a new version of the
data item written.
• Usage of the timestamps to label versions.
• When a read(X) operation is issued, select an appropriate version of
X based on the timestamp of the transaction, and return the value
of the selected version.
• Timestamps and Locking
• Generally, timestamping performs better than locking in situations
where:
– Most transactions are read-only.
– It is rare that concurrent transaction will try to read and write
the same element.
• In high-conflict situation, locking performs better than timestamps
Concurrency Control by
Validation(18.9)
At a Glance
Introduction
Validation based scheduling
Validation based Scheduler
Expected exceptions
Validation rules
Example
Comparisons
Summary
Introduction
What is optimistic concurrency control?
(assumes no unserializable behavior will occur)
• Timestamp- based scheduling and
• Validation-based scheduling
(allows T to access data without locks)
Validation based scheduling
Validation: checks made to ensure that the
serializability is not affected.
Scheduler keeps a record of what the active
transactions are doing.
Executes in 3 phases
1. Read- reads from RS( ), computes local address
2. Validate- compares read and write sets
3. Write- writes from WS( )
Validation based Scheduler
Contains an assumed serial order of
transactions.
Maintains three sets:
– START( ): set of T’s started but not completed
validation.
– VAL( ): set of T’s validated but not finished the
writing phase.
– FIN( ): set of T’s that have finished.
Expected exceptions
1. Suppose there is a transaction U, such that:
 U is in VAL or FIN; that is, U has validated,
 FIN(U)>START(T); that is, U did not finish before T started
 RS(T) ∩WS(T) ≠φ; let it contain database element X.
2. Suppose there is transaction U, such that:
• U is in VAL; U has successfully validated.
•FIN(U)>VAL(T); U did not finish before T entered its validation phase.
•WS(T) ∩ WS(U) ≠φ; let x be in both write sets.
Validation rules
Check that RS(T) ∩ WS(U)= φ for any
previously validated U that did not finish
before T has started i.e. FIN(U)>START(T).
Check that WS(T) ∩ WS(U)= φ for any
previously validated U that did not finish
before T is validated i.e. FIN(U)>VAL(T)
Example
Solution
 Validation of U:
Nothing to check
 Validation of T:
WS(U) ∩ RS(T)= {D} ∩{A,B}=φ
WS(U) ∩ WS(T)= {D}∩ {A,C}=φ
 Validation of V:
RS(V) ∩ WS(T)= {B}∩{A,C}=φ
WS(V) ∩ WS(T)={D,E}∩ {A,C}=φ
RS(V) ∩ WS(U)={B} ∩{D}=φ
 Validation of W:
RS(W) ∩ WS(T)= {A,D}∩{A,C}={A}
WS(W) ∩ WS(V)= {A,D}∩{D,E}={D}
WS(W) ∩ WS(V)= {A,C}∩{D,E}=φ
(W is not validated)
Comparison
Concurrency control
Mechanisms
Storage Utilization
Delays
Locks
Space in the lock table is
proportional to the number of
database elements locked.
Delays transactions but
avoids rollbacks
Timestamps
Space is needed for read and
write times with every database
element, neither or not it is
currently accessed.
Do not delay the
transactions but cause them
to rollback unless Interface
is low
Validation
Space is used for timestamps
and read or write sets for each
currently active transaction, plus
a few more transactions that
finished after some currently
active transaction began.
Do not delay the
transactions but cause them
to rollback unless interface
is low
21.1 Introduction to Information
Integration
CS257 Ashish Sharma
Need for Information Integration
• All the data in the world could put in a single
database (ideal database system)
• Databases In are created independently
hard to design a database to support future
use
• The use of databases evolves, so we can not
design a database to support every possible
future use.
University Database
• Registrar: to record student and grade
• Bursar: to record tuition payments by students
• Human Resources Department: to record
employees
• Many applications were built using these
databases . Examples include generation of
payroll checks, calculation of taxes and social
security payments to government.
Inconvenient
• change in 1 database would not reflect in the
other database which had to be performed
manually.
• Record grades for students who pay tuition
• Want to swim in SJSU aquatic center for free
in summer vacation?
(all the cases above cannot achieve the
function by a single database)
• Solution: one database
How to integrate
• Start over
build one database: contains all the legacy
databases; rewrite all the applications
result: painful
• Build a layer of abstraction (middleware)
on top of all the legacy databases
this layer is often defined by a collection of
classes
BUT…
• Due to non compatibility in structures, when
we try to connect information sources that
were developed independently, we invariably
find that sources differ in many ways. Such
sources are called Heterogeneous, and the
problem of integrating them is referred to as
the Heterogeneity Problem.
Heterogeneity Problem
• What is Heterogeneity Problem
Aardvark Automobile Co.
1000 dealers has 1000 databases
to find a model at another dealer
can we use this command:
SELECT * FROM CARS
WHERE MODEL=“A6”;
Type of Heterogeneity
•
•
•
•
•
•
Communication Heterogeneity
Query-Language Heterogeneity
Schema Heterogeneity
Data type difference
Value Heterogeneity
Semantic Heterogeneity
Communication Heterogeneity
• Now a days, it is common to allow access to
our information using HTTP protocols.
However, some dealers may not make their
databases available on net, but instead accept
the remote accesses via anonymous FTP.
Query Language Heterogeneity
• Refers to the manner in which the query or
modify a dealer’s database may vary.
• For e.g. Some of the dealers may have
different versions of database like some might
use relational database some might not have
relational database, or some of the dealers
might be using SQL, some might be using Excel
spreadsheets or some other database.
Data type Diffrences
• Serial Numbers can be represented by
character strings of varying length at one
source and fixed length at the other.
• The fixed lengths could differ, and probably at
some places, might use integers rather than
character strings.
Semantic Heterogeneity
• Terms might be given different interpretations
at different sources. One dealer might include
trucks in Cars relation, while the another puts
only automobile data in Cars relation. One
dealer might distinguish station wagons from
the minivans, while another doesn’t.
Chapter 21.2
Modes of Information Integration
Name: Ashish Sharma
Federations
 The simplest architecture for integrating several
DBs
 One to one connections between all pairs of
DBs
 n DBs talk to each other, n(n-1) wrappers are
needed
 It is good to use when the communications
between DBs are limited.
Wrapper
•
A software translates incoming queries and
outgoing answers.
–
Has the provision to allow the information sources to allow the the
sources to confirm the shared schema.
Federations Diagram
DB1
DB2
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
DB3
DB4
A federated collection of 4 DBs needs 12 components to translate queries from one
to another.
Example
Car dealers want to share their inventory. Each dealer queries the
other’s DB to find the needed car.
Dealer-1’s DB relation: NeededCars(model,color,autoTrans)
Dealer-2’s DB relation: Auto(Serial, model, color)
Options(serial,option)
wrapper
Dealer-1’s DB
wrapper
Dealer-2’s DB
Example…
For(each tuple(:m,:c,:a) in NeededCars){
if(:a=TRUE){/* automatic transmission wanted */
SELECT serial
FROM Autos, Options
WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’
AND Autos.model = :m AND Autos.color =:c;
}
Else{/* automatic transmission not wanted */
SELECT serial
FROM Auto
WHERE Autos.model = :m AND
Autos.color = :c AND
NOT EXISTS( SELECT * FROM Options WHERE serial = Autos.serial
AND option=‘autoTrans’);
}
}
Dealer 1 queries Dealer 2 for needed cars
Data Warehouse


Sources are translated from their local
schema to a global schema and copied to a
central DB.
User transparent: user uses Data Warehouse
just like an ordinary DB
Warehouse Diagram
User
query
result
Warehouse
Combiner
Extractor
Extractor
Source 1
Source 2
Example
Construct a data warehouse from sources DB of 2 car dealers:
Dealer-1’s schema: Cars(serialNo, model,color,autoTrans,cdPlayer,…)
Dealer-2’s schema: Auto(serial,model,color)
Options(serial,option)
Warehouse’s schema:
AutoWhse(serialNo,model,color,autoTrans,dealer)
An extractor, is a ‘query’ to extract data from Dealer – 1’s data.
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,autoTrans,’dealer1’
From Cars;
Example
Extractor --- Query to extract data from Dealer-2’s data:
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,’yes’,’dealer2’
FROM Autos,Options
WHERE Autos.serial=Options.serial AND
option=‘autoTrans’;
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,’no’,’dealer2’
FROM Autos
WHERE NOT EXISTS ( SELECT * FROM serial =Autos.serial
AND option = ‘autoTrans’);
Construct Data Warehouse
There are mainly 3 ways to constructing
the data in the warehouse:
1)
Periodically reconstructed from the current data in the
sources, once a night or at even longer intervals.
Advantages:
simple algorithms.
Disadvantages:
1) we need to shut down the warehouse;
2) data might be outdated.
Construct Data Warehouse
2)
Updated periodically based on the changes(i.e. each
night) of the sources.
Advantages:
involve smaller amounts of data. (important when warehouse is
large and needs to be modified in a short period)
Disadvantages:
1) the process to calculate changes to the warehouse is complex.
2) data can become out of date.
Construct Data Warehouse
3)
Changed immediately, in response to each change or a
small set of changes at one or more of the sources.
Advantages:
data won’t become out of date.
Disadvantages:
requires too much communication, therefore, it is
generally too expensive.
(practical for warehouses whose underlying sources changes
slowly.)
Mediators

Virtual warehouse, which supports a virtual view or
a collection of views, that integrates several
sources.
 Mediator doesn’t store any data.
 Mediators’ tasks:
1)receive user’s query,
2)send queries to wrappers,
3) Combine the results of wrappers,
4) Send the final result to user.
A Mediator diagram
Result
User query
Mediator
Query
Result
Result
Wrapper
Query
Result
Source 1
Query
Wrapper
Query
Result
Source 2
Example
Same data sources as the example of data warehouse, the mediator
Integrates the same two dealers’ source into a view with schema:
AutoMed(serialNo,model,color,autoTrans,dealer)
When the user have a query:
SELECT sericalNo, model
FROM AkutoMed
Where color=‘red’
In this simple case, the mediator forwards the same query to each
Of the two wrappers.
Wrapper1: Cars(serialNo, model, color, autoTrans, cdPlayer, …)
SELECT serialNo,model
FROM cars
WHERE color = ‘red’;
Wrapper2: Autos(serial,model,color); Options(serial,option)
SELECT serial, model
FROM Autos
WHERE color=‘red’;
Example
There may be different options for the mediator to forward user query,
for example, the user queries if there are a specific model&color car
(i.e. “Gobi”, “blue”).
The mediator decides 2nd query is needed or not based on the result of
1st query. That is, If dealer-1 has the specific car, the mediator doesn’t
have to query dealer-2.
Chapter 21 Information Integration
21.3 Wrappers in Mediator-Based Systems
Wrappers in Mediator-based Systems
 More complicated than that in most data warehouse
system.
 Able to accept a variety of queries from the mediator
and translate them to the terms of the source.
 Pass the result to the mediator.
How to design a wrapper?
Classify the possible queries that the mediator can
ask into templates, which are queries with
parameters that represent constants.
Templates for Query Patterns:
 Use notation T=>S to express the idea that the
template T is turned by the wrapper into the source
query S.
• Example 1
Dealer 1
Cars (serialNo, model, color, autoTrans,
navi,…)
For use by a mediator with schema
AutoMed (serialNo, model, color, autoTrans,
dealer)
• If the code representing the color is denoted by $c,,
template will be:
then the
SELECT *
FROM AutosMed
WHERE color = ’$c’;
=>
SELECT serialNo, model, color, autoTrans, ’dealer1’
FROM Cars
WHERE color=’$c’;
(Template T => Source query S)
There will be a total of 2n templates if we have the option of
specifying n attributes.
Wrapper Generators
• The wrapper generator creates a table holds
the various query patterns contained in the
templates.
• The source queries that are associated with
each.
A driver is used in each wrapper, the task of
the driver is to:
 Accept a query from the mediator.
 Search the table for a template that matches the
query.
 The source query is sent to the source, again using a
“plug-in” communication mechanism.
 The response is processed by the wrapper.
• Example 2
• If wrapper is designed with more complicated
template with queries specify both model and
color.
SELECT *
FROM AutosMed
WHERE model = ’$m’ AND color = ’$c’;
=>
SELECT serialNo, model, color, autoTrans, ’dealer1’
FROM Cars
WHERE model = ’$m’ AND color=’$c’;
• Now we suppose the only template we have is color. However the wrapper
is asked by the Mediator to find “blue Gobi model car.”
Solution:
1. Use template with $c=‘blue’ find all blue cars
and store them in a temporary relation:
TemAutos (serialNo, model, color, autoTrans,
dealer)
2.The wrapper then return to the mediator the
desired set of automobiles by excuting the local
query:
SELECT*
FROM TemAutos
WHERE model= ’Gobi’;
INFORMATION
INTEGRATION
Sections 21.4 – 21.5
21.4 Capability Based Optimization
• Introduction
– Typical DBMS estimates the cost of each query
plan and picks what it believes to be the best
– Mediator – has knowledge of how long its sources
will take to answer
– Cost measure is not the only factor on which the ‘Optimization of
mediator queries’ can rely upon.
21.4.1 The Problem of Limited Source
Capabilities
• Many sources have only Web Based interfaces
• Web sources usually allow querying through a
query form
• E.g. Amazon.com interface allows us to query about books in many
different ways.
• But we cannot ask questions that are too
general
– E.g. Select * from books;
21.4.1 The Problem of Limited Source
Capabilities (con’t)
• Reasons why a source may limit the ways in
which queries can be asked
– Earliest database did not use relational DBMS that
supports SQL queries
– Indexes on large database may make certain queries
feasible, while others are too expensive to execute
– Security reasons
• E.g. Medical database may answer queries about averages,
but won’t disclose details of a particular patient's
information
21.4.2 A Notation for Describing
Source Capabilities
Adornments – Sequences of codes that
represent the requirements for the attributes
of the relation, in their standard order
f(free) – attribute can be specified or not
b(bound) – must specify a value for an attribute
but any value is allowed
u(unspecified) – not permitted to specify a value
for a attribute
21.4.2 A notation for Describing
Source Capabilities….(cont’d)

c[S](choice from set S) means that a value must be
specified and value must be from finite set S.
o[S](optional from set S) means either do not
specify a value or we specify a value from finite set
S
A prime (f’) specifies that an attribute is not a part
of the output of the query
A capabilities specification is a set of adornments

A query must match one of the adornments in its capabilities specification
21.4.2 A notation for Describing
Source Capabilities….(cont’d)
E.g. Dealer 1 is a source of data in the form:
Cars (serialNo, model, color, autoTrans, navi)
Adornment for this query form is b’uuuu
21.4.3 Capability-Based Query-Plan
Selection
• Given a query at the mediator, a capability based
query optimizer first considers what queries it
can ask at the sources to help answer the query
• The process is repeated until:
– Enough queries are asked at the sources to resolve all
the conditions of the mediator query and therefore
query is answered. Such a plan is called feasible.
– No more valid forms of source queries can be constructed, yet still
cannot answer the mediator query. It has been an impossible query.
21.4.3 Capability-Based Query-Plan
Selection (cont’d)
• The simplest form of mediator query where we
need to apply the above strategy is join relations
• E.g we have sources for dealer 2
– Autos(serial, model, color)
– Options(serial, option)
• Suppose that ubf is the sole adornment for Auto and
Options have two adornments, bu and uc[autoTrans, navi]
• The query her is to find the serial numbers and colors of
Gobi models with a navigation system
21.4.4 Adding Cost-Based
Optimization
• Mediator’s Query optimizer is not done when
the capabilities of the sources are examined
• Having found feasible plans, it must choose
among them
• Making an intelligent, cost based query
optimization requires that the mediator knows
a great deal about the costs of queries
involved
• Since the Sources are independent of the mediator, it is difficult to
estimate the cost
21.5 Optimizing Mediator Queries
• Chain algorithm –
– A greedy algorithm which answers the query by
sending a sequence of requests to its sources. Will
always find a solution assuming at least one
solution exists.
– The solution may not be optimal.
21.5.2 Obtaining Answers for
Subgoals
• Rules for subgoals and sources:
– Suppose we have the following subgoal:
Rx1x2…xn(a1, a2, …, an),
and source adornments for R are: y1y2…yn.
• If yi is b or c[S], then xi = b.
• If xi = f, then yi is not output restricted.
– The adornment on the subgoal matches the
adornment at the source:
• If yi is f, u, or o[S] and xi is either b or f.
21.5.3 The Chain Algorithm
• Maintains 2 types of information:
– An adornment for each subgoal.
– A relation X that is the join of the relations for all
the subgoals that have been resolved.
• If the mediator query provides a constant binding for the corresponding
argument of that subgoal, then he adornment for a subgoal is b .
21.5.3 The Chain Algorithm (con’t)
First, initialize adornments of subgoals and X.
Then, repeatedly select a subgoal that can be
resolved. Let Rα(a1, a2, …, an) be the subgoal:
1.
Wherever α has a b, we shall find the argument in R is a constant, or a
variable in the schema of R.
 Project X onto its variables that appear in R.
21.5.3 The Chain Algorithm (con’t)
2. For each tuple t in the project of X, issue a
query to the source as follows (β is a source
adornment).
–
If β has b, then the corresponding component of α has b, and we can
use the corresponding component of t for source query.
– If β has c[S], and the corresponding component
of t is in S, then the corresponding component of
α has b, and we can use the corresponding
component of t for the source query.
–
If β has f, and the corresponding component of α is b, provide a
constant value for source query.
21.5.3 The Chain Algorithm (con’t)
– If a component of β is u, then provide no binding
for this component in the source query.
– If a component of β is o[S], and the
corresponding component of α is f, then treat it
as if it was a f.
–
If a component of β is o[S], and the corresponding component of α is
b, then treat it as if it was c[S].
3. Every variable among a1, a2, …, an is now
bound. For each remaining unresolved subgoal, change its
adornment so any position holding one of these variables is b.
21.5.3 The Chain Algorithm (con’t)
4. Replace X with X πs(R), where S is all of the
variables among: a1, a2, …, an.
5. Project out of X all components that
α
correspond to variables
that do not appear
in the head or in any unresolved subgoal.
•
For the algorithm to have an answer, every subgoal should be be
resolved and then X would be the answer.
21.5.3 The Chain Algorithm Example
• Mediator query:
– Q: Answer(c) ← Rbf(1,a) AND Sff(a,b) AND Tff(b,c)
• Example:
Relation
R
w
Data
1
Adornment
S
T
x
x
y
y
z
2
2
4
4
6
1
3
3
5
5
7
1
4
5
8
bf
c’[2,3,5]f
bu
21.5.3 The Chain Algorithm Example
• Initially, the adornments on the subgoals are
the same as Q, and X contains an empty tuple.
– Since they have ff adornments, S and T cannot be resolved, but the
sources have either a, b or c.
• Send R(w,x) with w=1 to get the tables on the
previous page.
21.5.3 The Chain Algorithm Example
(con’t)
• Since only the second component of R(1,a) is a variable, project the
subgoal’s relation onto its second component,
a
2
3
4
• This is joined with X, resulting in X equaling
this relation.
• Change adornment on S from ff to bf.
21.5.3 The Chain Algorithm Example
(con’t)
• Now we resolve Sbf(a,b):
– Project X onto a, resulting in X.
– Now, search S for tuples with attribute a equivalent to attribute a in X.
a
b
2
4
3
5
• Join this relation with X, and remove a as it doesn’t appear in the head
nor any unresolved subgoal:
b
4
5
21.5.3 The Chain Algorithm Example
(con’t)
• Now we resolve Tbf(b,c):
b
c
4
6
5
7
5
8
• Join this relation with X and project onto the c
attribute to get the relation for the head.
• Solution is {(6), (7), (8)}.
21.5.4 Incorporating Union Views at
the Mediator
• This implementation of the Chain Algorithm
does not consider that several sources can
contribute tuples to a relation.
• If specific sources have tuples to contribute that other sources may not
have, it adds complexity.
• To resolve this, we can consult all sources, or
make best efforts to return all the answers.
21.5.4 Incorporating Union Views at
the Mediator (con’t)
• Consulting All Sources
• Less practical because it makes queries harder
to answer and impossible if any source is
down.
• Best Efforts
– We need only 1 source with a matching
adornment to resolve a subgoal.
– Need to modify chain algorithm to revisit each subgoal when that
subgoal has new bound requirements.
Local-as-View Mediators
Priya Gangaraju(Class Id:203)
Local-as-View Mediators.
• In a LAV mediator, global predicates defined are
not views of the source data.
• Expressions are defined for each source with
global predicates that describe tuples that
source produces
• Mediator answers the queries by constructing
the views as provided by the source.
Motivation for LAV Mediators
• The Relationship provided by the mediator is
more subtle.
• For example, consider the predicate Par(c, p)
meaning that p is a parent of c which
represents the set of all child parent facts that
could ever exist.
• The sources will provide information about
whatever child-parent facts they know.
Motivation(contd..)
• There can be sources which may provide childgrandparent facts but not child- parent facts
at all.
• This source can never be used to answer the
child-parent query under GAV mediators.
• LAV mediators allow to say that a certain
source provides grand parent facts.
• This can be used to discover how and when to
use the source in a given query.
Example..
• Consider global predicate Par(c, p) meaning
that p is a parent of c.
• One source produces parent facts. Its view is
defined by the conjunctive queryV1(c, p)  Par(c, p)
• Another source produces some grand parents
facts. Then its conjunctive query will be –
V2(c, g)  Par(c, p) AND Par(p, g)
Example contd..
• The query at mediator will ask for great-grand
parent facts to be obtained from sources:
Q(w, z)  Par(w, x) AND Par(x, y) AND Par(y, z)
• One solution can be using the parent
predicate(V1) directly three times.
Q(w, z)  V1(w, x) AND V1 (x, y) AND V1(y, z)
• Another solution can be to use V1(parent facts)
and V2(grandparent facts).
Q(w, z)  V1(w, x) AND V2(x, z)
Or Q(w, z)  V2(w, y) AND V1(y, z)
Expanding Solutions.
• Consider a query Q, a solution S that has a
body whose subgoals are views and each view
V is defined by a conjunctive query with that
view as the head.
• The body of V’s conjunctive query can be
substituted for a subgoal in S that uses the
predicate V to have a body consisting of only
global predicates.
Expansion Algorithm
• A solution S has a subgoal V(a1, a2,…an) where
ai’s can be any variables or constants.
• The view V can be of the form
V(b1, b2,….bn)  B
Where B represents the entire body.
• V(a1, a2, … an) can be replaced in solution S by
a version of body B that has all the subgoals of
B with variables possibly altered.
Expansion Algorithm contd..
 The rules for altering the variables of B are:
1. First identify the local variables B, variables
that appear in the body but not in the head.
2. If there are any local variables of B that
appear in B or in S, replace each one by a
distinct new variable that appears nowhere
in the rule for V or in S.
3. In the body B, replace each bi by ai for i =
1,2…n.
Example.
• Consider the view definitions,
V1(c, p)  Par(c, p)
V2(c, g)  Par(c, p) AND Par(p, g)
• One of the proposed solutions S is
Q(w, z)  V1(w, x) AND V2(x, z)
• The first subgoal with predicate V1 in the solution can be expanded as
Par(w, x) as there are no local variables.
Example Contd.
• The V2 subgoal has a local variable p which
doesn’t appear in S nor it has been used as a
local variable in another substitution. So p can
be left as it is.
• Only x and z are to be substituted for variables c and g.
• The Solution S now will be
Q(w, z)  Par(w, x) AND Par(x, p) AND Par(p,z)
Containment of Conjunctive Queries
 A containment mapping from Q to E is a
function т from the variables of Q to the
variables and constants of E, such that:
1.
If x is the ith argument of the head of Q, then т(x) is the ith argument
of the head of E.
2. Add to т the rule that т(c)=c for any constant
c. If P(x1,x2,… xn) is a subgoal of Q, then
P(т(x1), т(x2),… т(xn)) is a subgoal of E.
Example.
• Consider two Conjunctive queries:
Q1: H(x, y)  A(x, z) and B(z, y)
Q2: H(a, b)  A(a, c) AND B(d, b) AND A(a, d)
• When we apply the substitution,
Т(x) = a, Т(y) = b, Т(z) = d, the head of Q1
becomes H(a, b) which is the head of Q2.
So,there is a containment mapping from Q1 to
Q2.
Example contd..
• The first subgoal of Q1 becomes A(a, d) which
is the third subgoal of Q2.
• The second subgoal of Q1 becomes the second
subgoal of Q2.
• There is also a containment mapping from Q2
to Q1 so the two conjunctive queries are
equivalent.
Why the Containment-Mapping Test
Works
• Suppose there is a containment mapping т from
Q1 to Q2.
• When Q2 is applied to the database, we look for
substitutions σ for all the variables of Q2.
• The substitution for the head becomes a tuple t
that is returned by Q2.
• If we compose т and then σ, we have a mapping
from the variables of Q1 to tuples of the database
that produces the same tuple t for the head of
Q1.
Finding Solutions to a Mediator Query
There can be infinite number of solutions built
from the views using any number of subgoals and
variables.
LMSS Theorem can limit the search which states
that
• If a query Q has n subgoals, then any answer produced
by any solution is also produced by a solution that has
at most n subgoals.
 If the conjunctive query that defines a view V has in its body a predicate P
that doesn’t appear in the body of the mediator query, then we need not
consider any solution that uses V.
Example.
• Recall the query
Q1: Q(w, z) Par(w, x) AND Par(x, y) AND
Par(y, z)
• This query has three subgoals, so we don’t
have to look at solutions with more than three
subgoals.
Why the LMSS Theorem Holds
• Suppose we have a query Q with n subgoals
and there is a solution S with more than n
subgoals.
• The expansion E of S must be contained in
Query Q, which means that there is a
containment mapping from Q to E.
• We remove from S all subgoals whose expansion was not the target of one
of Q’s subgoals under the containment mapping.
Contd..
• We would have a new conjunctive query S’
with at most n subgoals.
• If E’ is the expansion of S’ then, E’ is a subset
of Q.
• S is a subset of S’ as there is an identity
mapping.
• Thus S need not be among the solutions to query Q.
Query Compiler
By:Payal Gupta
Roll No:106(225)
Professor :Tsau Young Lin
Pushing Selections
• It is, replacing the left side of one of the rules
by its right side.
• In pushing selections we first a selection as
far up the tree as it would go, and then push
the selections down all possible branches.
• Let’s take an example:
• S t a r s I n ( t i t l e , year, starName)
• Movie(title, year, length, incolor, studioName,
producerC#)
• Define view MoviesOf 1996 by:
CREATE VIEW MoviesOfl996 AS
SELECT *
FROM Movie
,WHERE year = 1996;
• "which stars worked for which studios in
1996?“ can be given by a SQL Query:
SELECT starName, studioName
FROM MoviesOfl996 NATURAL JOIN StarsIn;
ΠstarName,studioName
O Year=1996
StarsIn
Movie
Logical query plan constructed from definition of a query and view
Improving the query plan by moving selections up and down the tree
ΠstarName,studioName
O Year=1996
Movie
O Year=1996
StarsIn
Laws Involving Projection
• "pushing" projections really involves introducing a
new projection somewhere below an existing
projection.
• projection keeps the number of tuples the same and
only reduces the length of tuples.
•
To describe the transformations of extended
projection Consider a term E + x on the list for a
projection, where E is an attribute or an expression
involving attributes and constants and x is an output
attribute.
Example
• Let R(a, b, c) and S(c, d, e) be two relations. Consider the
expression x,+,,,, b+y(R w S). The input attributes of the
projection are a,b, and e, and c is the only join attribute.
We may apply the law for pushing projections below
joins to get the equivalent expression:
Πa+e->x,b->y(Πa,b,c(R)
•
Eliminating this projection and getting a third
equivalent expression:Πa+e->x, b->y( R
Πc,e(S))
Πc,e(S))
• In addition, we can perform a projection
entirely before a bag union. That is:
ΠL(R UB S)= ΠL(R) )UB ΠL(S)
Laws About Joins and Products
• laws that follow directly from the definition of the join:
R c S=
Oc(
R * S)
• R
S = ΠL( O c ( R * S) ) , where C is the condition that
equates each pair of attributes from R and S with the same
name. and L is a list that includes one attribute from each
equated pair and all the other attributes of R and S.
• We identify a product followed by a selection as a join of
some kind.
Laws Involving Duplicate Elimination
• The operator δ which eliminates duplicates
from a bag can be pushed through many but
not all operators.
• In general, moving a δ down the tree reduces the size of
intermediate relations and may therefore beneficial.
• Moreover, sometimes we can move δ to a
position where it can be eliminated
altogether,because it is applied to a relation
that is known not to possess duplicates.
• δ (R)=R if R has no duplicates. Important cases
of such a relation R include:
a) A stored relation with a declared primary
key, and
b) A relation that is the result of a γ operation,
since grouping creates a relation with no
duplicates.
• Several laws that "push" δ through other
operators are:
• δ (R*S) =δ(R) * δ(S)
• δ (R
S)=δ(R)
δ(S)
• δ (R c S)=δ(R) c δ(S)
• δ (O c (R))=O c (δ(R))
• We can also move the δ to either or both of the
arguments of an intersection:
• δ (R ∩B S) = δ(R) ∩B S = R ∩B δ (S) = δ(R) ∩B δ (S)
Laws Involving Grouping and Aggregation
• When we consider the operator γ, we find that
the applicability of many transformations
depends on the details of the aggregate
operators used. Thus we cannot state laws in
the generality that we used for the other
operators. One exception is that a γ absorbs a δ
. Precisely:
• δ(γL(R))=γL(R)
• let us call an operator γ duplicate-impervious
if the only aggregations in L are MIN and/or
MAX then:
• γ L(R) = γ L (δ(R)) provided γL is duplicateimpervious.
Example
• Suppose we have the relations
MovieStar(name , addr , gender, birthdate)
StarsIn(movieTitle, movieyear, starname)
and we want to know for each year the birthdate
of the youngest star to appear in a movie that
year. We can express this query as:
SELECT movieyear, MAX(birth date)
FROM MovieStar, StarsIn
WHERE name = starName
GROUP BY movieyear;
γ movieYear, MAX ( birthdate )
name = starName
O
MovieStar StarsIn
Initial logical query plan for the query
• Some transformations that we can apply to Fig are
1. Combine the selection and product into an equijoin.
2.Generate a δ below the γ , since the γ is duplicateimpervious.
3. Generate a Π between the γ and the introduced δ to
project onto movie-Year and birthdate, the only
attributes relevant to the γ
γ movieYear, MAX ( birthdate )
Π movieYear, birthdate
δ
name = starName
MovieStar StarsIn
Another query plan for the query
γ movieYear, MAX ( birthdate )
Π movieYear, birthdate
name = starName
δ
δ
Π birthdate,name Π movieYear,starname
MovieStar
StarsIn
third query plan for Example
The Query Compiler
Section 16.3
DATABASE SYSTEMS – The Complete Book
Review
Query
Parser
Section 16.1
Preprocessor
Logical query
plan generator
Section 16.3
Query Rewriter
Preferred logical query plan
Two steps to turn Parse tree into Preferred
Logical Query Plan
• Replace the nodes and structures of the parse tree, in
appropriate groups, by an operator or operators of relational
algebra.
• Take the relational algebra expression and turn it into an
expression that we expect can be converted to the most
efficient physical query plan.
Reference Relations
• StarsIn (movieTitle, movieYear, starName)
• MovieStar (name, address, gender, birthdate)
Conversion to Relational Algebra
• If we have a <Query> with a <Condition> that has no
subqueries, then we may replace the entire construct – the
select-list, from-list, and condition – by a relational-algebra
expression.
• The relational-algebra expression consists of
the following from bottom to top:
– The products of all the relations mentioned in the
<FromList>, which Is the argument of:
– A selection σC, where C is the <Condition> expression in
the construct being replaced, which in turn is the argument
of: A projection πL , where L is the list of attributes in the
<SelList>
Example:
•
SELECT movieTitle
FROM Starsin, MovieStar
WHERE starName = name AND
birthdate LIKE ‘%1960’;
SELECT movieTitle
FROM Starsin, MovieStar
WHERE starName = name AND
birthdate LIKE ‘%1960’;
Translation to an algebraic expression tree
Removing Subqueries From Conditions
• For parse trees with a <Condition> that has a
subquery
• Intermediate operator – two argument selection
• It is intermediate in between the syntactic
categories of the parse tree and the relationalalgebra operators that apply to relations.
Using a two-argument σ
πmovieTitle
σ
<Condition>
StarsIn
<Tuple>
<Attribute>
starName
IN
πname
σ birthdate LIKE ‘%1960'
MovieStar
Two argument selection with condition involving
IN
•
Now say we have, two arguments – some relation and the
second argument is a <Condition> of the form t IN S.
•
•
•
‘t’ – tuple composed of some attributes of R
‘S’ – uncorrelated subquery
Steps to be followed:
1.
2.
3.
Replace the <Condition> by the tree that is the expression for S ( δ is
used to remove duplicates)
Replace the two-argument selection by a one-argument selection σC.
Give σC an argument that is the product of R and S.
Two argument selection with condition involving
IN
σ
R
σC
<Condition>
t
IN
X
S
R
δ
S
The effect
Improving the Logical Query Plan
• Algebraic laws to improve logical query plans:
– Selections can be pushed down the expression tree
as far as they can go.
– Similarly, projections can be pushed down the tree,
or new projections can be added.
– Duplicate eliminations can sometimes be removed,
or moved to a more convenient position in the tree.
– Certain selections can be combined with a product below to turn the
pair of operations into an equijoin.
Grouping Associative/ Commutative Operators
• An operator that is associative and commutative
operators may be though of as having any number of
operands.
• We need to reorder these operands so that the multiway
join is executed as sequence of binary joins.
• Its more time consuming to execute them in the order suggested by parse
tree.
• For each portion of subtree that consists of nodes with the
same associative and commutative operator (natural
join, union, and intersection), we group the nodes with
these operators into a single node with many children.
The effect of query rewriting
Π movieTitle
Starname = name
StarsIn
σbirthdate LIKE ‘%1960’
MovieStar
Final step in producing logical query plan
=>
R
U
U
U
R
S
T
V
W
U
U
S
T
V
W
An Example to summarize
• “find movies where the average age of the stars was at most
40 when the movie was made”
• SELECT distinct m1.movieTitle, m1,movieYear
FROM StarsIn m1
WHERE m1.movieYear – 40 <= (
SELECT AVG (birthdate)
FROM StartsIn m2, MovieStar s
WHERE m2.starName = s.name AND
m1.movieTitle = m2.movieTitle AND
m1.movieYear = m2.movieyear
);
SELECT distinct m1.movieTitle, m1,movieYear
FROM StarsIn m1
WHERE m1.movieYear – 40 <= (
SELECT AVG (birthdate)
FROM StartsIn m2, MovieStar s
WHERE m2.starName = s.name AND
m1.movieTitle = m2.movieTitle AND
m1.movieYear = m2.movieyear );
Selections combined with a product to turn the
pair of operations into an equijoin…
Condition pushed up the expression tree…
`
Selections combined…
16.4 Estimating the Cost of Operations
Project Guide
Dr. T. Y. Lin
Computer Science Dept
San Jose State University
Prepared By
Akshay Shenoy
Estimating the Cost of Operations
• After we get to the logical query plan, we turn
it into physical plan.
• Cost based enumeration corresponds to the
consideration of all the possible physical plan
and estimation of their costs. The one with
least estimated cost is the one selected to be
passed to the query-execution engine.
Physical Plan
For each physical plan select
• An order and grouping for associative-andcommutative operations like joins, unions.
• An Algorithm for each operator in the logical
plan.
eg: whether nested loop join or hash join to be
used
• Additional operators that are needed for the
physical plan but that were not present
explicitly in the logical plan. eg: scanning,
sorting
Estimating Sizes of Intermediate Relations
Rules for estimating the number of tuples in an
intermediate relation:
1. Give accurate estimates
2. Are easy to compute
3. Are logically consistent
•
Objective of estimation is to select best physical plan with least cost.
Estimating the Size of a Projection
We should treat a classical, duplicate-eliminating
projection as a bag-projection.
The projection is different from the other operators,
in that the size of the result is computable. Since a
projection produces a result tuple for every
argument tuple, the only change in the output size is
the change in the lengths of the tuples.
Estimating the Size of a Selection(1)
• While performing selection, we may reduce the
number of tuples but the sizes of tuple remain
same.
• Let
, ,where A is an attribute of
S


A  c(R)
R and C is a constant. Then we recommend as
an estimate:
T(S) =T(R)/V(R,A)
• The rule above surely holds if all values of
attribute A occur equally often in the database.
Estimating the Size of a Selection(2)
• If S  a
(R) ,then our estimate for
T(s) is: T(S) = T(R)/3
 c
• We may use T(S)=T(R)(V(R,a) -1 )/ V(R,a) as an estimate.
• When the selection condition C is the And of
several equalities and inequalities, we can treat
the selection
as a cascade of simple
selections, each
c(Rof) which checks for one of the
conditions.
Estimating the Size of a Selection(3)
• A less simple, but possibly more accurate estimate of the
size of S  c1 OR c2(R)is to assume that C1 and 2 of
which satisfy C2, we would estimate the number of tuples
in S as
m
n(1  (1  m1 / n)(1  m2 / n))
In explanation, 1  m1 / n is the fraction of tuples that do
not satisfy C1, and1  m2 / n is the fraction that do not
satisfy C2. The product of these numbers is the fraction of
R’s tuples that are not in S, and 1 minus this product is the
fraction that are in S.
Estimating the Size of a Join
• two simplifying assumptions:
1. Containment of Value Sets
If R and S are two relations with attribute Y and V(R,Y)<=V(S,Y) then every
Y-value of R will be a Y-value of S.
2. Preservation of Value Sets
Join a relation R with another relation S with attribute A in R and not in S
then all distinct values of A are preserved and not lost.V(S  R,A) = V(R,A)
Under these assumptions, we estimate
T(R S) = T(R)T(S)/max(V(R,Y), V(S, Y))
Natural Joins With Multiple Join Attributes
Of the T(R),T(S) pairs of tuples from R and S, the expected
number of pairs that match in both y1 and y2 is:
T(R)T(S)/max(V(R,y1), V(S,y1)) max(V(R, y2), V(S, y2))
In general, the following rule can be used to estimate the size of a
natural join when there are any number of attributes shared between
the two relations.
The estimate of the size of R S is computed by
multiplying T(R) by T(S) and dividing by the largest of V(R,y)
and V(S,y) for each attribute y that is common to R and S.
Joins of Many Relations(1)
• rule for estimating the size of any join
Start with the product of the number of tuples in
each relation. Then, for each attribute A appearing
at least twice, divide by all but the least of V(R,A)’s.
We can estimate the number of values that will remain for attribute A
after the join. By the preservation-of-value-sets assumption, it is the
least of these V(R,A)’s.
Joins of Many Relations(2)
Based on the two assumptions-containment and
preservation of value sets:
• No matter how we group and order the terms in a
natural join of n relations, the estimation of rules,
applied to each join individually, yield the same
estimate for the size of the result.
Estimating Sizes for Other Operations
• Union: the average of the sum and the
larger.
• Intersection:
• approach1: take the average of the
extremes, which is the half the smaller.
• approach2: intersection is an extreme
case of the natural join, use the formula
• T(R S) = T(R)T(S)/max(V(R,Y), V(S, Y))

Estimating Sizes for Other Operations
• Difference: T(R)-(1/2)*T(S)
• Duplicate Elimination: take the smaller of
(1/2)*T(R) and the product of all the V(R, ai )’s.
• Grouping and Aggregation: upper-bound the
number of groups by a product of V(R,A)’s, here
attribute A ranges over only the grouping
attributes of L. An estimate is the smaller of
(1/2)*T(R) and this product.
16.5 Introduction to Cost-based
plan selection
• Whether selecting a logical query plan or constructing a
physical query plan from a logical plan, the query optimizer
needs to estimate the cost of evaluating certain expressions.
• We shall assume that the "cost" of evaluating an expression is
approximated well by the number of disk I/O's performed.
The number of disk I/O’s, in turn, is influenced by:
1. The particular logical operators chosen to implement the
query, a matter decided when we choose the logical query
plan.
2. The sizes of intermediate results (whose estimation we
discussed in Section 16.4)
3. The physical operators used to implement logical operators.
e.g.. The choice of a one-pass or two-pass join, or the choice
to sort or not sort a given relation.
4. The ordering of similar operations, especially joins
5. The method of passing arguments from one physical operator to the next.
Obtaining Estimates for Size Parameter
• The formulas of Section 16.4 were predicated on knowing
certain important parameters, especially T(R), the number of
tuples in a relation R, and V(R, a), the number of different
values in the column of relation R for attribute a.
• A modern DBMS generally allows the user or administrator
explicitly to request the gathering of statistics, such as T(R)
and V(R, a). These statistics are then used in subsequent
query optimizations to estimate the cost of operations.
• By scanning an entire relation R, it is straightforward to count the number
of tuples T(R) and also to discover the number of different values V(R, a)
for each attribute a.
• The number of blocks in which R can fit, B(R), can be
estimated either by counting the actual number of blocks
used (if R is clustered), or by dividing T(R) by the number of
tuples per block
Computation of Statistics
• Periodic re-computation of statistics is the norm in most
DBMS's, for several reasons.
– First, statistics tend not to change radically in a short time.
– Second, even somewhat inaccurate statistics are useful as long as they
are applied consistently to all the plans.
– Third, the alternative of keeping statistics up-to-date can make the
statistics themselves into a "hot-spot" in the database; because
statistics are read frequently, we prefer not to update them frequently
too.
• The recomputation of statistics might be triggered
automatically after some period of time, or after some
number of updates.
• However, a database administrator noticing, that poorperforming query plans are being selected by the query
optimizer on a regular basis, might request the recomputation
of statistics in an attempt to rectify the problem.
• Computing statistics for an entire relation R can be very
expensive, particularly if we compute V(R, a) for each
attribute a in the relation.
• One recommended approach is to compute approximate statistics by
sampling only a fraction of the data. For example, let us suppose we want
to sample a small fraction of the tuples to obtain an estimate for V(R, a).
Heuristics for Reducing the Cost of Logical Query
Plans
• One important use of cost estimates for queries or subqueries is in the application of heuristic transformations of the
query.
• We have already observed previously how certain heuristics
applied independent of cost estimates can be expected
almost certainly to improve the cost of a logical query plan.
• However, there are other points in the query optimization
process where estimating the cost both before and after a
transformation will allow us to apply a transformation where
it appears to reduce cost and avoid the transformation
otherwise.
• In particular, when the preferred logical query plan is being generated, we
may consider a number of optional transformations and the costs before
and after.
• Because we are estimating the cost of a logical query plan, so
we have not yet made decisions about the physical operators
that will be used to implement the operators of relational
algebra, our cost estimate cannot be based on disk I/Os.
• An alternative approach is to estimate the sizes of all intermediate results
using the techniques of Section 16.1, and their sum is our heuristic
estimate for the cost of the entire logical plan.
• For example,
• Consider the initial logical query plan of as shown below,
δ
σa = 10
R
S
• The statistics for the relations R and S be as follows
R(a, b)
T(R) = 5000
V(R, a) = 50
V(R, b) = 100
S(b, c)
T(S) = 2000
V(S, a) = 200
V(S, b) = 100
• To generate a final logical query plan from, we shall insist that the selection be
pushed down as far as possible. However, we are not sure whether it makes
sense to push the δ below the join or not. Thus, we generate from the two
query plans shown in next slide. They differ in whether we have chosen to
eliminate duplicates before or after the join.
250
50
δ
500
δ
δ
100 σa = 10
S
2000
5000 R
1000
1000
0
100 σa = 10
S
2000
5000 R
(a)
(b)
• We know how to estimate the size of the result of the
selections, we divide T(R) by V(R, a) = 50.
• We also know how to estimate the size of the joins; we
multiply the sizes of the arguments and divide by max(V(R, b),
V(S, b)), which is 200.
Approaches to Enumerating Physical Plans
• Let us consider the use of cost estimates in the conversion of
a logical query plan to a physical query plan.
• The baseline approach, called exhaustive, is to consider all
combinations of choices (for each of issues like order of joins,
physical implementation of operators, and so on).
• Each possible physical plan is assigned an estimated cost, and the one with
the smallest cost is selected.
• There are two broad approaches to exploring the space of
possible physical plans:
– Top-down: Here, we work down the tree of the logical query plan from
the root.
– Bottom-up: For each sub-expression of the logical-query-plan tree, we
compute the costs of all possible ways to compute that subexpression. The possibilities and costs for a sub-expression E are
computed by considering the options for the sub-expressions for E,
and combining them in all possible ways with implementations for the
root operator of E.
Branch-and-Bound Plan Enumeration
• This approach, often used in practice, begins by using
heuristics to find a good physical plan for the entire logical
query plan. Let the cost of this plan be C. Then as we consider
other plans for sub-queries, we can eliminate any plan for a
sub-query that has a cost greater than C, since that plan for
the sub-query could not possibly participate in a plan for the
complete query that is better than what we already know.
• Likewise, if we construct a plan for the complete query that
has cost less than C, we replace C by the cost of this better
plan in subsequent exploration of the space of physical query
plans.
Hill Climbing
• This approach, in which we really search for a “valley” in the
space of physical plans and their costs; starts with a
heuristically selected physical plan.
• We can then make small changes to the plan, e.g., replacing
one method for an operator by another, or reordering joins by
using the associative and/or commutative laws, to find
"nearby" plans that have lower cost.
• When we find a plan such that no small modification yields a plan of lower
cost, we make that plan our chosen physical query plan.
Dynamic Programming
• In this variation of the general bottom-UP strategy, we keep
for each sub-expression only the plan of least cost.
• As we work UP the tree, we consider possible
implementations of each node, assuming the best plan for
each sub-expression is also used.
Selinger-Style Optimization
• This approach improves upon the dynamic-programming
approach by keeping for each sub-expression not only the
plan of least cost, but certain other plans that have higher
cost, yet produce a result that is sorted in an order that may
be useful higher up in the expression tree. Examples of such
interesting orders are when the result of the sub-expression is
sorted on one of:
– The attribute(s) specified in a sort (r) operator at the root
– The grouping attribute(s) of a later group-by (γ) operator.
– The join attribute(s) of a later join.
Choosing an Order for Joins
Introduction
• This section focuses on critical problem in costbased optimization:
– Selecting order for natural join of three or more
relations
• Compared to other binary operations, joins take
more time and therefore need effective
optimization techniques
Significance of Left and Right Join
Arguments
• The argument relations in joins determine the
cost of the join
• The left argument of the join is
– Called the build relation
– Assumed to be smaller
– Are stored in main memory.
Significance of Left and Right Join
Arguments
• The right argument of the join is
– Called the probe relation
– Read a block at a time
– Its tuples are matched with those of build relation
• The join algorithms which distinguish between the arguments are:
– One-pass join
– Nested-loop join
– Index join
Join Trees
• Order of arguments is important for joining
two relations
• Left argument, since stored in main-memory,
should be smaller
• With two relations only two choices of join
tree
• With more than two relations, there are n! ways to order the arguments
and therefore n! join trees, where n is the no. of relations
Join Trees
• Total # of tree shapes T(n) for n relations given
by recurrence:
•
•
•
•
T(1) = 1
T(2) = 1
T(3) = 2
T(4) = 5 … etc
Left-Deep Join Trees
• Consider 4 relations. Different ways to join
them are as follows
• In fig (a) all the right children are leaves. This
is a left-deep tree
• In fig (c) all the left children are leaves. This is
a right-deep tree
• Fig (b) is a bushy tree
• Considering left-deep trees is advantageous for deciding join orders
Join order
• Join order selection
– A1
A2
A3
..
– Left deep join trees
An
An
Ai
– Dynamic programming
• Best plan computed for each subset of relations
– Best plan (A1, .., An) = min cost plan of(
Best plan(A2, .., An)
A1
Best plan(A1, A3, .., An)
A2
….
Best plan(A1, .., An-1))
An
Dynamic Programming to Select a Join
Order and Grouping
• Three choices to pick an order for the join of many relations
are:
– Consider all of the relations
– Consider a subset
– Use a heuristic o pick one
• Dynamic programming is used either to consider all or a
subset
– Construct a table of costs based on relation size
– Remember only the minimum entry which will required to
proceed
Dynamic Programming to Select a Join
Order and Grouping
•
Can be used to consider all or a subset
•
Can be used for table construction of costs based on
relation size
•
Remember only the minimum entry which will
required to proceed
Dynamic Programming to Select a Join
Order and Grouping
Dynamic Programming to Select a Join
Order and Grouping
Dynamic Programming to Select a Join
Order and Grouping
Dynamic Programming to Select a Join
Order and Grouping
•
Disadvantage of dynamic programming is that it
does not involve the actual costs of the joins in
the calculations
•
Can be improved by adding the following
features
• Use disk’s I/O for evaluating cost
• When computing cost of R1 join R2, since
we sum cost of R1 and R2, we must also
compute estimates for there sizes
A Greedy Algorithm for Selecting a Join
Order
• It is expensive to use an exhaustive method
like dynamic programming
• Better approach is to use a join-order
heuristic for the query optimization
• Greedy algorithm can be an example of
– Make one decision at a time about order of join and never backtrack
on the decisions once made
QUERY PROCESSING
What is query processing?
A given SQL query is translated by the query
processor into a low level execution plan
• An execution plan is a program in a functional
language:
– The physical relational algebra, specific for each
DBMS.
• The physical relational algebra extends the
relational algebra with:
– Primitives to search through the internal data structures of the
DBMS
What is a Query Processor
 Group of components of a DBMS that converts
a user queries and data-modification
commands into a sequence of database
operations
 It also executes those operations
 Must supply detail regarding how the query is to be executed
292
Major parts of Query processor
Query Execution:
The algorithms
that manipulate
the data of the
database.
Focus on the operations
of extended relational
algebra.
293
Query Processing Steps
SQL Query
PARSER (parsing and semantic checking as in any compiler)
Parse tree (~ tree structure representing relational calculus expression)
OPTIMIZER (very advanced)
Execution plan (annotated relation algebra expression)
EXECUTOR (execution plan interpreter)
DBMS kernel
Data structures
Outline of Query Compilation
Query compilation
 Parsing : A parse tree for the
query is constructed
 Query Rewrite : The parse tree is
converted to an initial query plan
and transformed into logical
query plan (less time)
 Physical Plan Generation : Logical
Q Plan is converted into physical query plan
by selecting algorithms and order of
execution of these operator.
295
Basic Steps in Query Processing
Physical-Query-Plan Operators
 Physical operators are implementations of the operator of relational
algebra.
 They can also be use in non relational algebra
operators like “scan” which scans tables, that
is, bring each tuple of some relation into main
memory
297
Basic Steps in Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
ROUTERS
Basic Steps in Query Processing (Cont.)
– Parsing and translation
• translate the query into its internal form. This is
then translated into relational algebra.
• Parser checks syntax, verifies relations
– Evaluation
• The query-execution engine takes a queryevaluation plan, executes that plan, and returns the
answers to the query
ROUTERS
Scanning Tables
 One of the basic thing we can do in a Physical
query plan is to read the entire contents of a
relation R.
300
Scanning Tables
Basic approaches to locate the tuples of a
relation R
 Table
Scan
 Relation R is stored in secondary memory with
its tuples arranged in blocks
 It is possible to get the blocks one by one
 Index-Scan
 If there is an index on any attribute of Relation
R, we can use this index to get all the tuples of
301
Relation R
Sorting While Scanning Tables
 Number of reasons to sort a relation
 Query
could include an ORDER BY clause,
requiring that a relation be sorted.
 Algorithms to implement relational algebra
operations requires one or both arguments to
be sorted relations.
 Physical-query-plan operator sort-scan takes a
relation R, attributes on which the sort is to be
made, and produces R in that sorted order
302
Computation Model for Physical
Operator
 Physical-Plan Operator should be selected
wisely which is essential for good Query
Processor .
 For “cost” of each operator is estimated by
number of disk I/O’s for an operation.
 The total cost of operation depends on the
size of the answer, and includes the final write
back cost to the total cost of the query.
303
Parameters for Measuring Costs
 Parameters that affect the performance of a
query
 Buffer space availability in the main memory at
the time of execution of the query
 Size of input and the size of the output
generated

The performance is also affected by the size of memory block on the
disk and the size in the main memory.
304
Parameters for Measuring Costs
 B: The number of blocks are needed to hold all tuples of relation R.
 Also denoted as B(R)
 T:The number of tuples in relationR.
 Also denoted as T(R)
 V: The number of distinct values that appear in a column of a relation R
 V(R, a)- is the number of distinct values of column for a in relation R
305
I/O Cost for Scan Operators
 If relation R is clustered, then the number of
disk I/O for the table-scan operator is = ~B disk
I/O’s
 If relation R is not clustered, then the number of required disk I/O
generally is much higher
 A index on a relation R occupies many fewer
than B(R) blocks
That means a scan of the entire relation R
which takes at least B disk I/O’s will require
more I/O’s than the entire index
306
Iterators for Implementation of
Physical Operators
 Many physical operators can be implemented
as an Iterator.
 Three methods forming the iterator for an
operation are:
 1. Open( ) :
 This method starts the process of getting
tuples

It initializes any data structures needed to perform the operation
307
Iterators for Implementation of
Physical Operators
 2. GetNext( ):
 Returns the next
tuple in the result
 If there are no more tuples to return,
GetNext returns a special value NotFound
 3. Close( ) :
 Ends the iteration after all tuples
 It calls Close on any arguments of the
operator
308
Query Execution
One-Pass Algorithms for Database Operations (15.2)
309
Categorizing Algorithms

By general technique
• sorting-based
• hash-based
• index-based

By the number of times data is read from
disk
• one-pass
• two-pass
• multi-pass (more than 2)

By what the operators work on
• tuple-at-a-time, unary
• full-relation, unary
• full-relation, binary
One-Pass Algorithm Methods
 Tuple-at-a-time, unary operations: (selection & projection)
 Full-relation, unary operations
 Full-relation, binary operations (set & bag versions of
union)
311
One-Pass, Tuple-at-a-Time
These are for SELECT and PROJECT
 Algorithm:




read the blocks of R sequentially into an input buffer
perform the operation
move the selected/projected tuples to an output buffer
Requires only M ≥ 1
 I/O cost is that of a scan (either B or T, depending
on if R is clustered or not)

One-Pass Algorithms for Tuple-at-a-Time
Operations
 Tuple-at-a-time operations are selection and projection



read the blocks of R one at a time into an input buffer
perform the operation on each tuple
move the selected tuples or the projected tuples to the output
buffer
 The disk I/O requirement for this process depends only on
how the argument relation R is provided.

If R is initially on disk, then the cost is whatever it takes to
perform a table-scan or index-scan of R.
313
One-Pass, Tuple-at-a-Time
duplicate elimination (DELTA)
 Algorithm:

• keep a main memory search data structure D (use search
tree or hash table) to store one copy of each tuple
• read in each block of R one at a time (use scan)
• for each tuple check if it appears in D
• if not then add it to D and to the output buffer

Requires 1 buffer to hold current block of R;
remaining M-1 buffers must be able to hold D

I/O cost is just that of the scan
One-Pass, Unary, Full-Relation
duplicate elimination (DELTA)
 Algorithm:

• keep a main memory search data structure D (use search
tree or hash table) to store one copy of each tuple
• read in each block of R one at a time (use scan)
• for each tuple check if it appears in D
• if not then add it to D and to the output buffer


This requires a buffer to hold current block of R; remaining M-1 buffers
must be able to hold D
I/O cost is just that of the scan operation performed.
A selection or projection being performed
on a relation R
316
One-Pass Algorithms for Unary, fillRelation Operations
 Duplicate Elimination

To eliminate duplicates, we can read each block of R one at a
time, but for each tuple we need to make a decision as to
whether:
It is the first time we have seen this tuple, in which case we
copy it to the output, or
2.
We have seen the tuple before, in which case we must not
output this tuple.
One memory buffer holds one block of R's tuples, and the
remaining M - 1 buffers can be used to hold a single copy of every
tuple.
1.

317
Managing memory for a one-pass
duplicate-elimination
318
Duplicate Elimination
 When a new tuple from R is considered, we compare it
with all tuples seen so far


if it is not equal: we copy both to the output and add it to the
in-memory list of tuples we have seen.
if there are n tuples in main memory: each new tuple takes
processor time proportional to n, so the complete operation
takes processor time proportional to n2.
 We need a main-memory structure that allows each of the
operations:


i) Add a new tuple and
ii) Tell whether a given tuple is already present.
319
Duplicate Elimination (…contd.)
 The different structures that can be used for such main
memory structures are:


Hash table
Balanced binary search tree
320
One-Pass Algorithms for Unary, fillRelation Operations
 Grouping

The grouping operation gives us zero or more grouping
attributes and presumably one or more aggregated attributes

If we create in main memory one entry for each group then
we can scan the tuples of R, one block at a time.

The entry for a group consists of values for the grouping attributes
and an accumulated value or values for each aggregation.
321
Grouping
 The accumulated value is:




For MIN(a) or MAX(a) aggregate, record minimum
/maximum value, respectively.
For any COUNT aggregation, add 1 for each tuple of group.
For SUM(a), add value of attribute a to the accumulated sum
for its group.
AVG(a) is a hard case. We must maintain 2 accumulations:
count of no. of tuples in the group & sum of a-values of these
tuples. Each is computed as we would for a COUNT & SUM
aggregation, respectively. After all tuples of R are seen, take
quotient of sum & count to obtain average.
322
One-Pass Algorithms for Binary
Operations
 Binary operations include:





Union
Intersection
Difference
Product
Join
323
Set Union
 We read S into M - 1 buffers of main memory and build a
search structure where the search key is the entire tuple.
 All these tuples are also copied to the
output.
 Read each block of R into the Mth
buffer, one at a time.
 For each tuple t of R, see if t is in S, and if not, we copy t
to the output. If t is also in S, we skip t.
324
Set Intersection
 Read S into M - 1 buffers and build a search structure with
full tuples as the search key.
 Read each block of R, and for each tuple t of R, see if t is
also in S. If so, copy t to the output, and if not, ignore t.
325
Set Difference
 Read S into M - 1 buffers and build a search structure with
full tuples as the search key.
 To compute R -s S, read each block of R and examine each
tuple t on that block. If t is in S, then ignore t; if it is not in
S then copy t to the output.
 To compute S -s R, read the blocks of R and examine
each tuple t in turn. If t is in S, then delete t from the
copy of S in main memory, while if t is not in S do
nothing.
 After considering each tuple of R, copy to the output those tuples of S
that remain.
326
Bag Intersection
 Read S into M - 1 buffers.
 Multiple copies of a tuple t are not stored individually.
Rather store 1 copy of t & associate with it a count equal to
no. of times t occurs.
 Next, read each block of R, & for each tuple t of R see
whether t occurs in S. If not ignore t; it cannot appear in
the intersection. If t appears in S, & count associated with t
is (+)ve, then output t & decrement count by 1. If t appears
in S, but count has reached 0, then do not output t; we have
already produced as many copies of t in output as there
327
were copies in S.
Bag Difference
 To compute S -B R, read tuples of S into main memory &
count no. of occurrences of each distinct tuple.
 Then read R; check each tuple t to see whether t occurs in
S, and if so, decrement its associated count. At the end,
copy to output each tuple in main memory whose count is
positive, & no. of times we copy it equals that count.
 To compute R -B S, read tuples of S into main memory & count no. of
occurrences of distinct tuples.
328
Bag Difference (…contd.)
 Think of a tuple t with a count of c as c reasons not to copy
t to the output as we read tuples of R.
 Read a tuple t of R; check if t occurs in S. If not, then copy
t to the output. If t does occur in S, then we look at current
count c associated with t. If c = 0, then copy t to output. If
c > 0, do not copy t to output, but decrement c by 1.
329
Product
 Read S into M - 1 buffers of main memory
 Then read each block of R, and for each tuple t of R
concatenate t with each tuple of S in main memory.
 Output each concatenated tuple as it is formed.
 This algorithm may take a considerable amount of
processor time per tuple of R, because each such tuple
must be matched with M - 1 blocks full of tuples.
However, output size is also large, & time/output tuple is
small.
330
Natural Join
 Convention: R(X, Y) is being joined with S(Y, Z), where Y
represents all the attributes that R and S have in common,
X is all attributes of R that are not in the schema of S, & Z
is all attributes of S that are not in the schema of R.
Assume that S is the smaller relation.
 To compute the natural join, do the following:
1.
Read all tuples of S & form them into a main-memory
search structure.
Hash table or balanced tree are good e.g. of such structures. Use
M - 1 blocks of memory for this purpose.
331
Natural Join
1.
Read each block of R into 1 remaining main-memory
buffer.
For each tuple t of R, find tuples of S that agree with t
on all attributes of Y, using the search structure.
For each matching tuple of S, form a tuple by joining it with t, &
move resulting tuple to output.
332
QUERY EXECUTION
15.3
Nested-Loop Joins
Introduction to Nested-Loop Joins
 Used for relations of any side.
Not necessary that relation fits in main memory
 Uses “One-and-a-half” pass method in which
for each variation:
– One argument read just once.
– Other argument read repeatedly.
 Two kinds:
• Tuple-Based Nested Loop Join
• Block-Based Nested Loop Join
ADVANTAGES OF NESTED-LOOP JOIN
Fits in the iterator framework.
Allows us to avoid storing
intermediate relation on disk.
Tuple-Based Nested-Loop Join
 Simplest variation of the
nested-loop join
 Loop ranges over individual
tuples
Tuple-Based Nested-Loop Join
 Algorithm to compute the Join R(X,Y) | | S(Y,Z)
FOR each tuple s in S DO
FOR each tuple r in R DO
IF r and s join to make tuple t THEN
output t
R and S are two Relations with r and s as tuples.
 carelessness in buffering of blocks causes the use
of T(R)T(S) disk I/O’s
IMPROVEMENT & MODIFICATION
To decrease the cost
Method 1: Use algorithm for Index-Based joins
– We find tuple of R that matches given tuple of S
– We need not to read entire relation R
Method 2: Use algorithm for Block-Based joins
– Tuples of R & S are divided into blocks
– Uses enough memory to store blocks in order to
reduce the number of disk I/O’s.
An Iterator for Tuple-Based Nested-Loop
Join
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Open0 C
R.Open()
S . Open ()
GetNextO {
REPEAT C
r := R.GetNext();
IF (r = NotFound) C /* R is exhausted for
the current s */
R.Close();
s := S.GetNext();
IF (s = NotFound) RETURN NotFound;
/* both R and S are exhausted */
R.Open0 ;
r := R.GetNext();
UNTIL(r and s join) ;
RETURN the join of r and s;
Close0 (
R. Close () ; S. Close () ;
Block-Based Nested-Loop Join Algorithm
Access to arguments is organized by block.
– While reading tuples of inner relation we
use less number of I/O’s disk.
Using enough space in main memory to store
tuples of relation of the outer loop.
– Allows to join each tuple of the inner
relation with as many tuples as possible.
•
•
•
•
•
•
•
•
•
•
•
•
•
FOR each chunk of M-1 blocks of S DO BEGIN
read these blocks into main-memory buffers;
organize their tuples into a search structure whose
search key is the common attributes of R and S;
FOR each block b of R DO BEGIN
read b into main memory;
FOR each tuple t of b DO BEGIN
find the tuples of S in main memory that
join with t ;
output the join of t with each of these tuples;
END ;
END ;
END ;
Block-Based Nested-Loop Join Algorithm
ALGORITHM:
FOR each chunk of M-1 blocks of S DO
FOR each block b of R DO
FOR each tuple t of b DO
find the tuples of S in memory that join with t
output the join of t with each of these tuples
Block-Based Nested-Loop Join Algorithm
• Assumptions:
– B(S) ≤ B(R)
– B(S) > M
– This means that the neither relation fits in the
entire main memory.
Analysis of Nested-Loop Join
– Number of disk I/O’s:
[B(S)/(M-1)]*(M-1 +B(R))
or
B(S) + [B(S)B(R)/(M-1)]
or approximately B(S)*B(R)/M
Two-Pass Algorithms
Based on Sorting
SECTION 15.4
Two-Pass Algorithms Based on
Sorting
• Two-pass Algorithms: data from operand relations
is read into main memory, then processed, written
out to disk and then re-read from the disk to
complete the operation. This is the basic principle.
Basic idea
• Step 1: Read M blocks of R into main memory.
• Step 2:Sort these M blocks in main memory,
using an efficient, main-memory sorting
algorithm. so we expect that the time to sort
will not exceed the disk 1/0 time for step (1).
• Step 3: Write the sorted list into M blocks of
disk.
Duplicate Elimination Using Sorting
δ(R)
• To perform δ(R) operation in two passes, we sort
tuples of R in sublists. Then we use available
memory to hold one block from each stored
sublists and then repeatedly copy one to the
output and ignore all tuples identical to it.
• The total cost of this algorithm is 3B(R)
• This algorithm requires only √B(R)blocks of main
memory, rather than B(R) blocks(one-pass
algorithm).
Example
• Suppose that tuples are integers, and only two
tuples fit on a block. Also, M = 3 and the
relation R consists of 17 tuples:
2,5,2,1,2,2,4,5,4,3,4,2,1,5,2,1,3
• After first-pass
Sublists
Elements
R1
1,2,2,2,2,5
R2
2,3,4,4,4,5
R3
1,1,2,3,5
Example
• Second pass
Sublist
In memory
Waiting on disk
R1
1,2
2,2, 2,5
R2
2,3
4,4, 4,5
R3
1,1
2,3,5
Sublist
In memory
Waiting on disk
R1
2
2,2, 2,5
R2
2,3
4,4, 4,5
After processing tuple 1
Output: 1
R3
2,3
Continue the same process with next tuple.
5
Grouping and Aggregation Using
Sorting γ(R)
• Two-pass algorithm for grouping and aggregation is quite
similar to the previous algorithm.
• Step 1:Read the tuples of R into memory, M blocks at a
time. Sort each M blocks, using the grouping attributes of L
as the sort key. Write each sorted sublist to disk.
• Step 2:Use one main-memory buffer for each sublist, and
initially load the first block of each sublist into its buffer.
• Step 3:Repeatedly find the least value of the sort key
(grouping attributes) present among the first available
tuples in the buffers.
• This algorithm takes 3B(R) disk 1/0's, and will work as long
as B(R) < M².
A Sort-Based Union Algorithm
• For bag-union one-pass algorithm is used.
• For set-union
– Step 1:Repeatedly bring M blocks of R into main memory, sort their
tuples, and write the resulting sorted sublist back to disk.
– Step 2:Do the same for S, to create sorted sublists for relation S.
– Step 3:Use one main-memory buffer for each sublist of R and S.
Initialize each with the first block from the corresponding sublist.
– Step 4:Repeatedly find the first remaining tuple t among all the
buffers. Copy t to the output. and remove from the buffers all copies
of t (if R and S are sets there should be at most two copies)
• This algorithm takes 3(B(R)+B(S)) disk 1/0's, and will work as long as
B(R)+B(S) < M².
Sort-Based Intersection and
Difference
• For both set version and bag version, the algorithm is
same as that of set-union except that the way we
handle the copies of a tuple t at the fronts of the
sorted sublists.
• For set intersection, output t if it appears in both R and
S.
• For bag intersection, output t the minimum of the
number of times it appears in R and in S.
• For set difference, R-S, output t if and only if it appears
in R but not in S.
• For bag difference, R-S, output t the number of times it
appears in R minus the number of times it appears in S.
A Simple Sort-Based Join Algorithm
Given relation R(x,y) and S(y,z) to join, and given M blocks of main memory for
buffers,
1. Sort R, using a two phase, multiway merge sort, with y as the sort key.
2. Sort S similarly
3. Merge the sorted R and S. Generally we use only two buffers, one for the current
block of R and the other for current block of S. The following steps are done
repeatedly.
a. Find least value y of the join attributes Y that is currently at the front of the
blocks for R and S.
b. If y doesn’t appear at the front of the other relation, then remove the tuples
with sort key y.
c. Otherwise identify all the tuples from both relation having sort key y
d. Output all tuples that can be formed by joining tuples from R and S with a
common Y value y.
e. If either relation has no more unconsidered tuples in main memory reload
buffer for the relation.
A More Efficient Sort-Based Join
• If we do not have to worry about very large numbers of
tuples with a common value for the join attribute(s), then
we can save two disk 1/0's per block by combining the
second phase of the sorts with the join itself
• To compute R(X, Y) ►◄ S(Y, Z) using M main-memory
buffers
– Create sorted sublists of size M, using Y as the sort key, for both
R and S.
– Bring the first block of each sublist into a buffer
– Repeatedly find the least Y-value y among the first available
tuples of all the sublists. Identify all the tuples of both relations
that have Y-value y. Output the join of all tuples from R with all
tuples from S that share this common Y-value
A More Efficient Sort-Based Join
• The number of disk I/O’s is 3(B(R) + B(S))
• It requires B(R) + B(S) ≤ M² to work
Summary of Sort-Based Algorithms
Operators
Approximate
M required
Disk I/O
γ,δ
√B
3B
U,∩,−
√(B(R) + B(S))
3(B(R) + B(S))
►◄
√(max(B(R),B(S)))
5(B(R) + B(S))
►◄(more efficient)
√(B(R) + B(S))
3(B(R) + B(S))
Query Execution
15.5 Two-pass Algorithms based on Hashing
At a glimpse
•
•
•
•
•
•
•
•
Introduction
Partitioning Relations by Hashing
Algorithm for Duplicate Elimination
Grouping and Aggregation
Union, Intersection, and Difference
Hash-Join Algorithm
Sort based Vs Hash based
Summary
Introduction
Hashing is done if the data is too big to store in
main memory buffers.
– Hash all the tuples of the argument(s) using an
appropriate hash key.
– For all the common operations, there is a way to
select the hash key so all the tuples that need to be
considered together when we perform the
operation have the same hash value.
– This reduces the size of the operand(s) by a factor
equal to the number of buckets.
Partitioning Relations by Hashing
Algorithm:
initialize M-1 buckets using M-1 empty buffers;
FOR each block b of relation R DO BEGIN
read block b into the Mth buffer;
FOR each tuple t in b DO BEGIN
IF the buffer for bucket h(t) has no room for t THEN
BEGIN
copy the buffer t o disk;
initialize a new empty block in that buffer;
END;
copy t to the buffer for bucket h(t);
END ;
END ;
FOR each bucket DO
IF the buffer for this bucket is not empty THEN
write the buffer to disk;
Duplicate Elimination
• For the operation δ(R) hash R to M-1 Buckets.
(Note that two copies of the same tuple t will hash to the same
bucket)
• Do duplicate elimination on each bucket Ri independently,
using one-pass algorithm
• The result is the union of δ(Ri), where Ri is the portion of R
that hashes to the ith bucket
Requirements
• Number of disk I/O's: 3*B(R)
– B(R) < M(M-1), only then the two-pass, hash-based
algorithm will work
• In order for this to work, we need:
– hash function h evenly distributes the tuples
among the buckets
– each bucket Ri fits in main memory (to allow the
one-pass algorithm)
– i.e., B(R) ≤ M2
Grouping and Aggregation
 Hash all the tuples of relation R to M-1 buckets, using a hash
function that depends only on the grouping attributes
(Note: all tuples in the same group end up in the same bucket)
 Use the one-pass algorithm to process each bucket
independently
 Uses 3*B(R) disk I/O's, requires B(R) ≤ M2
Union, Intersection, and Difference
• For binary operation we use the same hash
function to hash tuples of both arguments.
• R U S we hash both R and S to M-1
• R ∩ S we hash both R and S to 2(M-1)
• R-S we hash both R and S to 2(M-1)
• Requires 3(B(R)+B(S)) disk I/O’s.
• Two pass hash based algorithm requires
min(B(R)+B(S))≤ M2
Hash-Join Algorithm
• Use same hash function for both relations; hash function
should depend only on the join attributes
•
•
•
•
Hash R to M-1 buckets R1, R2, …, RM-1
Hash S to M-1 buckets S1, S2, …, SM-1
Do one-pass join of Ri and Si, for all i
3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2
Sort based Vs Hash based
• For binary operations, hash-based only limits size
to min of arguments, not sum
• Sort-based can produce output in sorted order,
which can be helpful
• Hash-based depends on buckets being of equal
size
• Sort-based algorithms can experience reduced
rotational latency or seek time
Index-Based Algorithms
Chapter 15
Section 15.6
366
Clustering and Nonclustering
Indexes
 A relation is “clustered” if its tuples are packed into roughly as
few blocks as can possibly hold those tuples.
 Clustering Indexes, which are indexes on an attribute or
attributes such that all the tuples with a fixed value for the
search key of this index appear on roughly as few blocks as can
hold them. Note that a relation that isn't clustered cannot
have a clustering index, but even a clustered relation can have
nonclustering indexes.
 A clustering index has all tuples with a fixed value packed into
the minimum possible number of blocks
367
Clustering and Nonclustering Indexes
• A relation that isn’t clustered cannot have a
clustering index
• A clustered relation can have nonclustering
indexes
368
Index-Based Selection
• Selection on equality: a=v(R)
• Clustered index on a: cost B(R)/V(R,a)
– If the index on R.a is clustering, then the number of disk
I/O's to retrieve the set a=v (R) will average B(R)/V(R, a).
The actual number may be somewhat higher.
369
Index-Based Selection
• The actual number may be higher:
1. index is not kept entirely in main
memory
2. they spread over more blocks
3. May not be tightly packed into the
blocks.
370
Example
• B(R)=1000, T(R)=20,000 number of I/O’s
required:
• 1. clustered, not index
1000
• 2. not clustered, not index
20,000
• 3. If V(R,a)=100, index is clustering
10
• 4. If V(R,a)=10, index is the nonclustering 2,000
371
Joining by Using an Index
• Natural join R(X, Y) S S(Y, Z)
Number of I/O’s to get R
Clustered: B(R)
Not clustered: T(R)
Number of I/O’s to get tuple t of S
Clustered: T(R)B(S)/V(S,Y)
Not clustered: T(R)T(S)/V(S,Y)
372
Example
• R(X,Y): 1000 blocks S(Y,Z)=500 blocks
Assume 10 tuples in each block,
so T(R)=10,000 and T(S)=5000
V(S,Y)=100
If R is clustered, and there is a clustering index on
Y for S
the number of I/O’s for R is:
1000
the number of I/O’s for S
is10,000*500/100=50,000
373
Joins Using a Sorted Index
• Natural join R(X, Y) S (Y, Z) with index on Y
for either R or S
• Extreme case: Zig-zag join
• Example:
relation R(X,Y) and R(Y,Z) with index on Y for
both relations
search keys (Y-value) for R: 1,3,4,4,5,6
search keys (Y-value) for S: 2,2,4,6,7,8
374
Chapter 15.7
Buffer Management
What does a buffer manager do?
With the help of buffers, the Central Task of making
memory buffers available to processors is
accomplished.
In practice:
1) rarely allocated in advance
2) the value of M may vary depending on system
conditions
Therefore, buffer manager is used to allow processes to
get the memory they need, while minimizing the delay
and unclassifiable requests.
The role of the buffer manager
Read/Writes
Requests
Buffers
Buffer
manager
Figure 1: The role of the buffer manager : responds to requests for
main-memory access to disk blocks
15.7.1 Buffer Management Architecture
Two broad architectures for a buffer manager:
1) Relational DBMS is the buffer manager which controls
main memory directly
2) The buffer manager allocates buffers in virtual memory,
allowing the OS to decide how to use buffers.
• “object-oriented” DBMS
Buffer Pool
Key setting for the Buffer manager to be efficient:
Problem:
The buffer manager should limit the number of buffers in
use so that they fit in the available main memory, i.e.
Don’t exceed available space.
The number of buffers is a parameter set when the DBMS is
initialized.
No matter which architecture of buffering is used, we simply
assume that there is a fixed-size buffer pool, a set of
buffers available to queries and other database actions.
Buffer Pool
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
DISK
•
•
DB
choice of frame dictated
by replacement policy
Data must be in RAM for DBMS to operate on it!
Buffer Manager hides the fact that not all data is in RAM.
15.7.2 Buffer Management Strategies
Buffer-replacement strategies:
Critical choice is that the buffer manager has to
make is when a buffer is needed for a newly
requested block and the buffer pool is full then
which block to throw out the buffer pool.
Buffer-replacement strategy -- LRU
Least-Recently Used (LRU):
To throw out the block that has not been read or written
for the longest time.
• Requires more maintenance but it is effective.
• Update the time table for every access.
• Least-Recently Used blocks are usually less likely to
be accessed sooner than other blocks.
Buffer-replacement strategy -- FIFO
First-In-First-Out (FIFO):
The buffer that has been occupied the longest by the
same block is emptied and used for the new block.
• Requires less maintenance but it can make more
mistakes.
• Keep only the loading time
• The oldest block doesn’t mean it is less likely to be
accessed.
Example: the root block of a B-tree index
Buffer-replacement strategy – “Clock”
The “Clock” Algorithm (“Second Chance”)
Think of the 8 buffers as arranged in a circle, shown as
Figure 3
Flag 0 and 1:
buffers with a 0 flag are ok to sent their contents back to
disk, i.e. ok to be replaced
buffers with a 1 flag are not ok to be replaced
Buffer-replacement strategy – “Clock”
0
0
1
0
the buffer with
a 0 flag will
be replaced
0
0
1
1
Start point to
search a 0 flag
The flag will
be set to 0
By next time the hand
reaches it, if the content of
this buffer is not accessed,
i.e. flag=0, this buffer will
be replaced.
That’s “Second Chance”.
Figure 3: the clock algorithm
Buffer-replacement strategy -- Clock
a buffer’s flag set to 1 when:
a block is read into a buffer
the contents of the buffer is accessed
a buffer’s flag set to 0 when:
the buffer manager needs a buffer for a new block, it
looks for the first 0 it can find, rotating clockwise. If it passes
1’s, it sets them to 0.
System Control helps Buffer-replacement strategy
System Control
The query processor or other components of a DBMS can give
advice to the buffer manager in order to avoid some of the
mistakes that would occur with a strict policy such as LRU,FIFO
or Clock.
For example:
A “pinned” block means it can’t be moved to disk without first
modifying certain other blocks that point to it.
In FIFO, use “pinned” to force root of a B-tree to remain in
memory at all times.
15.7.3 The Relationship Between Physical
Operator Selection and Buffer Management
Problem:
Physical Operator expected certain number of
buffers M for execution.
However, the buffer manager may not be able to
guarantee these M buffers are available.
15.7.3 The Relationship Between Physical
Operator Selection and Buffer Management
Questions:
Can the algorithm adapt to changes of M, the
number of main-memory buffers available?
When available buffers are less than M, and some
blocks have to be put in disk instead of in memory.
How the buffer-replacement strategy impact the
performance (i.e. the number of additional I/O’s)?
Example
FOR each chunk of M-1 blocks of S DO BEGIN
read these blocks into main-memory buffers;
organize their tuples into a search structure whose
search key is the common attributes of R and S;
FOR each block b of R DO BEGIN
read b into main memory;
FOR each tuple t of b DO BEGIN
find the tuples of S in main memory that
join with t ;
output the join of t with each of these tuples;
END ;
END ;
END ;
Figure 15.8: The nested-loop join algorithm
Example
The outer loop number (M-1) depends on the average
number of buffers are available at each iteration.
The outer loop use M-1 buffers and 1 is reserved for a block
of R, the relation of the inner loop.
If we pin the M-1 blocks we use for S on one iteration of the
outer loop, we shall not lose their buffers during the round.
Also, more buffers may become available and then we could
keep more than one block of R in memory.
Will these extra buffers improve the running time?
Example
CASE1: NO
Buffer-replacement strategy: LRU
Buffers for R: k
We read each block of R in order into buffers.
By end of the iteration of the outer loop, the last k blocks of R
are in buffers.
However, next iteration will start from the beginning of R
again.
Therefore, the k buffers for R will need to be replaced.
Example
CASE 2: YES
Buffer-replacement strategy: LRU
Buffers for R: k
We read the blocks of R in an order that alternates:
firstlast and then lastfirst.
In this way, we save k disk I/Os on each iteration of the outer
loop except the first iteration.
Other Algorithms and M buffers
Other Algorithms also are impact by M and the
buffer-replacement strategy.
Sort-based algorithm
If we use a sort-based algorithm for some operator, then it is
possible to adapt to changes in M.
If Af shrinks, we can change the size of a sublist,
since the sort-based algorithms we discussed do not depend on the sublists
being the same size. The major limitation is that as M shrinks,
we could be forced to create so many sublists that we cannot then
allocate a buffer for each sublist in the merging process..
Reason that we use more than two passes:
Two passes are usually enough, however, for
the largest relation, we use as many passes as
necessary.
Chapter 15 Query Execution
Algorithms using more than two passes
•
•
Multi-pass Sort-based Algorithms
Suppose we have M main-memory buffers available
to sort a relation R, which we assume is stored
clustered.
•
Then we do the following:
If R fits in M blocks (i.e., B(R)<=M)
1. Read R into main memory.
2. Sort it using any main-memory sorting algorithm.
3. Write the sorted relation to disk.
INDUCTION:
If R does not fit into main memory.
1. Partition the blocks holding R into
M
groups, which we shall call R1, R2, R3…
2. Recursively sort Ri for each i=1,2,3…M.
3. Merge the M sorted sublists.
If we are not merely sorting R, but performing
a unary operation such as δ or γ on R.
We can modify the above so that at the final
merge we perform the operation on the
tuples at the front of the sorted sublists.
That is:
• For a δ, output one copy of each distinct tuple,
and skip over copies of the tuple.
• For a γ, sort on the grouping attributes only,
and combine the tuples with a given value of
these grouping attributes.
Performance of Multipass, Sort-Based
Algorithms
•
•
•
•
•
BASIS: If k = 1, i.e., one pass is allowed, then we must have B(R) < M. Put
another way, s(M, 1) = Af.
INDUCTION: Suppose k > 1. Then we partition R into 1M pieces, each of
which must be sortable in k - 1 passes. If B(R) = s(M, k), then s(M, k)/:l17
which is the size of each of the M pieces of R, cannot exceed s(M, k - 1).
That
• is: s(M, k) = Ms(M, k - 1)
Multipass Hash-Based Algorithms
•
BASIS: The basis is that, for a unary operation, if the relation fits in hl buffers, read
it into memory and perform111 the operation.
• For a binary operation, if either relation fits in ,11 - I buffers, perform the
operation by reading this relation into main
memory and then read the second relation, one block at a time, into the Mth
buffer.
•
•
INDUCTION: If no relation fits in main memory, then hash each relation into A 1 -1
buckets, as discussed in Section 15.5.1. Recursively perform the operation on each
bucket or corresponding pair of buckets, and accumulate the output
from each bucket or pair.
Chapter 15 Query Execution
15.8 Algorithms using more than two
passes
Reason that we use more than two passes:
Two passes are usually enough, however, for
the largest relation, we use as many passes as
necessary.
•
•
Multi-pass Sort-based Algorithms
Suppose we have M main-memory buffers
available to sort a relation R, which we assume is
stored clustered.
•
Then we do the following:
If R fits into M blocks (i.e., B(R)<=M)
1. Read R into main memory.
2. Sort it using any main-memory sorting
algorithm.
3. Write the sorted relation to disk.
INDUCTION:
If R does not fit into main memory.
1. Partition the blocks holding R into
M
groups, which we shall call R1, R2, R3…
2. Recursively sort Ri for each i=1,2,3…M.
3. Merge the M sorted sublists.
If we are not merely sorting R, but performing
a unary operation such as δ or γ on R.
We can modify the above statement so that at the final merge we perform
the operation on the tuples at the front of the sorted sublists.
That is:
• For a δ, output one copy of each distinct tuple, and
skip over copies of the tuple.
• For a γ, sort on the grouping attributes only, and
combine the tuples with a given value of these
grouping attributes.
Conclusion
The two pass algorithms based on sorting or
hashing have natural recursive analogs that
take three or more passes and will work for
larger amounts of data.
Performance of Multipass, Sort-Based
Algorithms
•
•
•
•
•
BASIS: If k = 1, i.e., if one pass is allowed, then we must have B(R) < M. Put
another way, s(M, 1) = Af.
INDUCTION : Suppose k > 1. Then we partition R into 1M pieces, each of
which must be sortable in k - 1 passes. If B(R) = s(M, k), then s(M, k)/:l17
which is the size of each of the M pieces of R, cannot exceed s(M, k - 1).
That
• is: s(M, k) = Ms(M, k - 1)