Transcript Slide 1

SECONDARY STORAGE
MANAGEMENT
SECTIONS 13.1 – 13.2
Sanuja Dabade & Eilbroun Benjamin
CS 257 – Dr. TY Lin
Presentation Outline

13.1 The Memory Hierarchy
13.1.1 The Memory Hierarchy
 13.1.2 Transfer of Data Between Levels
 13.1.3 Volatile and Nonvolatile Storage
 13.1.4 Virtual Memory


13.2 Disks
13.2.1 Mechanics of Disks
 13.2.2 The Disk Controller
 13.2.3 Disk Access Characteristics

Presentation Outline (con’t)

13.3 Accelerating Access to Secondary
Storage
13.3.1 The I/O Model of Computation
 13.3.2 Organizing Data by Cylinders
 13.3.3 Using Multiple Disks
 13.3.4 Mirroring Disks
 13.3.5 Disk Scheduling and the Elevator
Algorithm
 13.3.6 Prefetching and Large-Scale
Buffering

13.1.1 Memory Hierarchy



Different components for data storage having
different data capacities.
Cost per byte to store data also varies
Smallest capacity devices offer the fastest speed
with highest cost per bit
Memory Hierarchy Diagram
Programs,
DBMS
Main Memory DBMS’s
As Visual Memory
Tertiary Storage
Disk
Main Memory
Cache
File System
13.1.1 Memory Hierarchy

Cache
 Lowest
level of the hierarchy
 Data items are copies of certain locations of main
memory.
 Sometimes, values in cache are changed and
corresponding changes to main memory are delayed.
 Machine looks for instructions as well as data for those
instructions in the cache.
 Holds limited amount of data, so it is usually necessary
to move something out of the cache in order to
accommodate the new data.
13.1.1 Memory Hierarchy (con’t)


No need to update the data in main memory
immediately in a single processor computer
In multiple processors data is updated immediately
to main memory….called as write through
Main Memory


Everything happens in the computer i.e. instruction
execution, data manipulation, as working on
information that is resident in main memory
Main memories are random access….one can
obtain any byte in the same amount of time
Secondary storage
Secondary Storage is a form of storage that is both
significantly slower and significantly more capacious then
main memory.



Used to store data and programs when they are not
being processed
More permanent than main memory, as data and
programs are retained when the power is turned off
E.g. magnetic disks, hard disks
Tertiary Storage
Tertiary storage device have been developed to hold
large data volume measured in terabytes. Principal
kind of terabyte storage devices:
 Ad-hoc Tape Storage
 Optical-Disk Juke Boxes
 Tape Silos
13.1.2 Transfer of Data Between levels




Data moves between adjacent levels of the
hierarchy
At the secondary or tertiary levels accessing the
desired data or finding the desired place to store
the data takes a lot of time
Disk is organized into bocks
Entire blocks are moved to and from memory called
a buffer
13.1.2 Transfer of Data Between level
(cont’d)


A key technique for speeding up database
operations is to arrange the data so that when one
piece of data block is needed it is likely that other
data on the same block will be needed at the same
time
Same idea applies to other hierarchy levels
13.1.3 Volatile and Non Volatile
Storage



A volatile device forgets what data is stored on it
after power off
Non volatile holds data for longer period even
when device is turned off
All the secondary and tertiary devices are non
volatile and main memory is volatile
13.1.4 Virtual Memory



Typical software executes in virtual memory
Address space is typically 32 bit or 232 bytes or
4GB
Transfer between memory and disk is in terms of
blocks
13.2.1 Mechanism of Disk

Mechanisms of Disks
 Use
of secondary storage is one of the important
characteristic of DBMS
 Consists of 2 moving pieces of a disk
1. disk assembly
 2. head assembly

 Disk
assembly consists of 1 or more platters
 Platters rotate around a central spindle
 Bits are stored on upper and lower surfaces of platters
13.2.1 Mechanism of Disk




Disk is organized into tracks
The track that are at fixed radius from center form
one cylinder
Tracks are organized into sectors
Tracks are the segments of circle separated by gap
13.2.2 Disk Controller


One or more disks are controlled by disk controllers
Disks controllers are capable of
 Controlling
the mechanical actuator that moves the
head assembly
 Selecting the sector from among all those in the
cylinder at which heads are positioned
 Transferring bits between desired sector and main
memory
 Possible buffering an entire track
13.2.3 Disk Access Characteristics

Accessing (reading/writing) a block requires 3 steps
 Disk
controller positions the head assembly at the
cylinder containing the track on which the block is
located. It is a ‘seek time’
 The disk controller waits while the first sector of the
block moves under the head. This is a ‘rotational
latency’
 All the sectors and the gaps between them pass the
head, while disk controller reads or writes data in
these sectors. This is a ‘transfer time’
SECONDARY STORAGE
MANAGEMENT
SECTION 13.3
Eilbroun Benjamin
CS 257 – Dr. TY Lin
Presentation Outline

13.3 Accelerating Access to Secondary
Storage
13.3.1 The I/O Model of Computation
 13.3.2 Organizing Data by Cylinders
 13.3.3 Using Multiple Disks
 13.3.4 Mirroring Disks
 13.3.5 Disk Scheduling and the Elevator
Algorithm
 13.3.6 Prefetching and Large-Scale
Buffering

13.3 Accelerating Access to Secondary
Storage

Several approaches for more-efficiently accessing
data in secondary storage:
Place blocks that are together in the same cylinder.
 Divide the data among multiple disks.
 Mirror disks.
 Use disk-scheduling algorithms.
 Prefetch blocks into main memory.



Scheduling Latency – added delay in accessing
data caused by a disk scheduling algorithm.
Throughput – the number of disk accesses per
second that the system can accommodate.
13.3.1 The I/O Model of Computation

The number of block accesses (Disk I/O’s) is a good
time approximation for the algorithm.


This should be minimized.
Ex 13.3: You want to have an index on R to identify
the block on which the desired tuple appears, but
not where on the block it resides.
For Megatron 747 (M747) example, it takes 11ms to
read a 16k block.
 A standard microprocessor can execute millions of
instruction in 11ms, making any delay in searching for
the desired tuple negligible.

13.3.2 Organizing Data by Cylinders


If we read all blocks on a single track or cylinder
consecutively, then we can neglect all but first seek
time and first rotational latency.
Ex 13.4: We request 1024 blocks of M747.
If data is randomly distributed, average latency is
10.76ms by Ex 13.2, making total latency 11s.
 If all blocks are consecutively stored on 1 cylinder:

 6.46ms
+ 8.33ms * 16 = 139ms
(1 average seek)
(time per rotation)
(# rotations)
13.3.3 Using Multiple Disks


If we have n disks, read/write performance will
increase by a factor of n.
Striping – distributing a relation across multiple disks
following this pattern:




Data on disk R1: R1, R1+n, R1+2n,…
Data on disk R2: R2, R2+n, R2+2n,…
…
Data on disk Rn: Rn, Rn+n, Rn+2n, …
Ex 13.5: We request 1024 blocks with n = 4.

6.46ms + (8.33ms * (16/4)) = 39.8ms
(1 average seek)
(time per rotation)
(# rotations)
13.3.4 Mirroring Disks




Mirroring Disks – having 2 or more disks hold
identical copied of data.
Benefit 1: If n disks are mirrors of each other, the
system can survive a crash by n-1 disks.
Benefit 2: If we have n disks, read performance
increases by a factor of n.
Performance increases further by having the
controller select the disk which has its head closest
to desired data block for each read.
13.3.5 Disk Scheduling and the
Elevator Problem


Disk controller will run this algorithm to select which
of several requests to process first.
Pseudo code:
requests[] // array of all non-processed data requests
 upon receiving new data request:

 requests[].add(new

request)
while(requests[] is not empty)
 move
head to next location
 if(head location is at data in requests[])


retrieve data
remove data from requests[]
 if(head

reaches end)
reverse head direction
13.3.5 Disk Scheduling and the
Elevator Problem (con’t)
Events:
Head starting point
Request data at 8000
Request data at 24000
Request data at 56000
Get data at 8000
Request data at 16000
Get data at 24000
Request data at 64000
Get data at 56000
Request Data at 40000
Get data at 64000
Get data at 40000
Get data at 16000
64000
56000
48000
40000
32000
24000
16000
8000
Current time
13.6
26.9
34.2
45.5
56.8
4.3
10
20
30
0
data
time
8000..
4.3
24000..
13.6
56000..
26.9
64000..
34.2
40000..
45.5
16000..
56.8
13.3.5 Disk Scheduling and the
Elevator Problem (con’t)
Elevator
Algorithm
FIFO
Algorithm
data
time
data
time
8000..
4.3
8000..
4.3
24000..
13.6
24000..
13.6
56000..
26.9
56000..
26.9
64000..
34.2
16000..
42.2
40000..
45.5
64000..
59.5
16000..
56.8
40000..
70.8
13.3.6 Prefetching and Large-Scale
Buffering
If at the application level, we can predict the order
blocks will be requested, we can load them into
main memory before they are needed.
Advantage: Speeds up access when the needed block
are known but the timing of requests is data
dependent.
Disadvantage: Required extra main- memory buffer.
No help when accesses are rendom.

-Disk failure ways and their mitigation
-Priya Gangaraju(Class Id-203)
Ways in which disks can fail Intermittent failure.
 Media Decay.
 Write failure.
 Disk Crash.
Intermittent Failures.
 Read or write operation on a sector are successful only
after repeated tries.
 Intermittent failure is the most common form of disk
failure.
 Parity checks can be used to detect this kind of
failure.
Media Decay.
 This is a serious form of failure.
 Bit/Bits are permanently corrupted and cannot be
recovered.
 Impossible to read a sector correctly even after many
trials.
 Stable storage technique for organizing a disk is used
to avoid this failure.
Write failure
Write failure occurs when attempt to write a sector is
not possible, and attempt to retrieve previously
written sector is unsuccessful.
 Possible reason – power outage while writing of the
sector.
 Stable Storage Technique can be used to avoid this.
Disk Crash
• Most serious form of disk failure.
• Entire disk becomes unreadable, suddenly and
permanently.
• RAID techniques can be used for coping with disk
crashes.
More on Intermittent failures…
 When we try to read a sector, but the correct content
of that sector is not delivered to the disk controller.
 If the controller has a way to tell that the sector is
good or bad (checksums), it can then reissue the read
request when bad data is read.
More on Intermittent Failures..
 The controller can attempt to write a sector, but the
contents of the sector are not what was intended.
 The only way to check this is to let the disk go around
again read the sector.
 One way to perform the check is to read the sector
and compare it with the sector we intend to write.
Contd..
 Instead of performing the complete comparison at
the disk controller, simpler way is to read the sector
and see if a good sector was read.
 If the sector is good, then the write was correct
otherwise the write was unsuccessful and must be
repeated.
Checksums.
 Technique used to determine the good/bad status of a
sector.
 Each sector has some additional bits called the
checksum that are set depending on the values of the
data bits in that sector.
 If checksum is not proper on reading, then there is an
error in reading.
Checksums(contd..)
 There is a small chance that the block was not read
correctly even if the checksum is proper.
 The probability of correctness can be increased by
using many checksum bits.
Checksum calculation.
 Checksum is based on the parity of all bits in the
sector.
 If there are odd number of 1’s among a collection of
bits, the bits are said to have odd parity. A parity bit
‘1’ is added.
 If there are even number of 1’s then the collection of
bits is said to have even parity. A parity bit ‘0’ is added.
Checksum calculation(contd..)
 The number of 1’s among a collection of bits and their
parity bit is always even.
 During a write operation, the disk controller
calculates the parity bit and append it to the sequence
of bits written in the sector.
 Every sector will have a even parity.
Examples…
 A sequence of bits 01101000 has odd number of 1’s.
The parity bit will be 1. So the sequence with the
parity bit will now be 011010001.
 A sequence of bits 11101110 will have an even parity as it
has even number of 1’s. So with the parity bit 0, the
sequence will be 111011100.
Checksum calculation(contd..)
 Any one-bit error in reading or writing the bits results
in a sequence of bits that has odd-parity.
 The disk controller can count the number of 1’s and
can determine if the sector has odd parity in the
presence of an error.
Odds.
 There are chances that more than one bit can be
corrupted and the error can be unnoticed.
 Increasing the number of parity bits can increase the
chances of detecting errors.
 In general, if there are n independent bits as
checksum, the chances of error will be one in 2n.
Stable Storage.
 Checksums can detect the error but cannot correct it.
 Sometimes we overwrite the previous contents of a
sector and yet cannot read the new contents correctly.
 To deal with these problems, Stable Storage policy
can be implemented on the disks.
Stable-Storage(contd..)
 Sectors are paired and each pair represents one sector-
contents X.
 The left copy of the sector may be represented as XL
and XR as the right copy.
Assumptions.
 We assume that copies are written with sufficient
number of parity bits to decrease the chance of bad
sector looks good when the parity checks are
considered.
 Also, If the read function returns a good value w for
either XL or XR then it is assumed that w is the true
value of X.
Stable -Storage Writing Policy:
Write the value of X into XL. Check the value has
status “good”; i.e., the parity-check bits are correct
in the written copy. If not repeat write. If after a set
number of write attempts, we have not successfully
written X in XL, assume that there is a media failure
in this sector. A fix-up such as substituting a spare
sector for XL must be adopted.
Repeat (1) for XR.
Stable-Storage Reading Policy:
 The policy is to alternate trying to read XL and XR
until a good value is returned.
 If a good value is not returned after pre chosen
number of tries, then it is assumed that X is truly
unreadable.
Error-Handling capabilities:
Media failures:
• If after storing X in sectors XL and XR, one of them
undergoes media failure and becomes permanently
unreadable, we can read from the second one.
• If both the sectors have failed to read, then sector X
cannot be read.
• The probability of both failing is extremely small.
Error-Handling
Capabilities(contd..)
Write Failure:
• When writing X, if there is a system failure(like power
shortage), the X in the main memory is lost and the
copy of X being written will be erroneous.
• Half of the sector may be written with part of new
value of X, while the other half remains as it was.
Error-Handling
Capabilities(contd..)
 The possible cases when the system becomes
available:
The failure occurred when writing to XL. Then XL is
considered bad. Since XR was never changed, its
status is good. We can make a copy of XR into XL,
which is the old value of X.
The failure occurred after XL is written. Then XL will
have the good status and XR which has the old value of
XR has bad status. We can copy the new value of X to
XR from XL.
Recovery from Disk Crashes.
 To reduce the data loss by Dish crashes, schemes
which involve redundancy, extending the idea of
parity checks or duplicate sectors can be applied.
 The term used for these strategies is RAID or
Redundant Arrays of Independent Disks.
 In general, if the mean time to failure of disks is n
years, then in any given year, 1/nth of the surviving
disks fail.
Recovery from Disk
Crashes(contd..)
 Each of the RAID schemes has data disks and redundant
disks.
 Data disks are one or more disks that hold the data.
 Redundant disks are one or more disks that hold
information that is completely determined by the contents
of the data disks.
 When there is a disk crash of either of the disks, then the
other disks can be used to restore the failed disk to avoid a
permanent information loss.
Disk Failures
Xiaqing He
ID: 204
Dr. Lin
Content
1)Focus on :
“How to recover from disk crashes”
common term RAID
“redundancy array of independent disks”
2)Several schemes to recover from disk crashes:
 Mirroring—RAID level 1;
 Parity checks--RAID 4;
 Improvement--RAID 5;
 RAID 6;
1) Mirroring

Mirroring Scheme is referred as RAID level 1 protection
against data loss scheme.

In this scheme we mirror each disk.

One of the disk is called as data disk and other redundant
disk.

In this case the only way data can be lost is if there is a
second disk crash while the first crash is being repaired.
Benefit:
• save data in case of one disk will fail.
• divide data on several disks and let access to several
blocks at once
1) Mirroring (con’t)
 For mirroring, when the data can be lost?
-- the only way data can be lost if there is a second
(mirror/redundant) disk crash while the first (data) disk crash is being
repaired.
 Possibility:
Suppose:
• One disk: mean time to failure = 10 years;
• One of the two disk: average of mean time to failure = 5 years;
• The process of replacing the failed disk= 3 hours=1/2920 year;
So:
• the possibility of the mirror disk will fail=1/10 * 1/2,920 =1/29,200;
• The possibility of data loss by mirroring: 1/5 * 1/29,200 = 1/146,000
2)Parity Blocks
 why changes?
-- disadvantages of Mirroring:
uses so many
redundant disks
What’s new?
-- RAID level 4: uses only one redundant disk
How this one redundant disk works?
-- modulo-2 sum;
-- the jth bit of the redundant disk is the modulo-2 sum
of the jth bits of all the data disks.
Example
2)Parity Blocks(con’t) - Example
Data disks:
• Disk1: 11110000
• Disk2: 10101010
• Disk3: 00111000
Redundant disk:
• Disk4: 01100010
2)RAID 4 (con’t)
 Reading
-- Similar with reading blocks from any disk;
 Writing
1)change the data disk;
2)change the corresponding block of the redundant disk;
• Why?
-- hold the parity checks for the corresponding blocks of
all the data disks
2)RAID 4 (con’t) _ writing
For a total N data disks:
1) naïve way:
• read N data disks and compute the modulo-2 sum of the
corresponding blocks;
• rewrite the redundant disk according to modulo-2 sum of
the data disks;
2) better way:
• Take modulo-2 sum of the old and new version of the data
block which was rewritten;
• Change the position of the redundant disk which was 1’s in
the modulo-2 sum;
2)RAID 4 (con’t) _ writing_Example

•
•
•
Data disks:
Disk1: 11110000
Disk2: 10101010  01100110
Disk3: 00111000
 to do:
• Modulo-2 sum of the old and new version of disk 2: 11001100
• So, we need to change the positions 1,2,5,6 of the redundant disk.
 Redundant disk:
• Disk4: 01100010  10101110
2)RAID 4 (con’t) _failure recovery
 Redundant disk crash:
-- swap a new one and recomputed data from all the data disks;
 One of Data disks crash:
-- swap a new one;
-- recomputed data from the other disks including data disks and
redundant disk;
 How to recomputed? (same rule, that’s why there will be some
improvement)
-- take modulo-2 sum of all the corresponding bits of all the other disks
3) An Improvement: RAID 5

RAID 4 is effective in preserving data unless there are two
simultaneous disk crashes.

Whatever scheme we use for updating the disks, we need to
read and write the redundant disk's block. If there are n data
disks, then the number of disk writes to the redundant disk will
be n times the average number of writes to any one data disk.

However we do not have to treat one disk as the redundant disk
and the others as data disks. Rather, we could treat each disk as
the redundant disk for some of the blocks. This improvement is
often called RAID level 5.
3) RAID 5 (con’t)
 How to recognize which blocks of each disk treat this disk as
redundant disk?
-- if there are n+1 disks which were labeled from
0 to N, then we can treat the i cylinder of disk
J as redundant if J is the remainder when I is
divided by n+1;
th
Example;
3) RAID 5 (con’t)_ Example
•
•
•
•
N=3;
The first disk, labeled as 0 : 4,8,12…;
The second disk, labeled as 1 : 1,5,9…;
The third disk, labeled as 2 : 2,6,10…;
……….
Suppose all the 4 disks are equally likely to be written,
for one of the 4 disks, the possibility of being written:
• 1/4 + 3 /4 * 1/3 =1/2
• If N=m => 1/m +(m-1)/m * 1/(m-1) = 2/m
4) Coping with multiple disk crashes
 Error-correcting codes theory known as Hamming code
leads to the RAID level 6.
– deal with any number of disk crashes if using
enough redundant disks
 Example
a system of seven disks ( four data disks_numer 1-4
and 3 redundant disks_ number 5-7);
• How to set up this 3*7 matrix ?
(why is 3? – there are 3 redundant disks)
1)every column values three 1’s and 0’s except for all
three 0’s;
2) column of the redundant disk has single 1’s;
3) column of the data disk has at least two 1’s;
4) Coping with multiple disk crashes (con’t)
Reading:
• read form the data disks and ignore the
redundant disk
Writing:
• Change the data disk
• change the corresponding bits of all the
redundant disks
4) Coping with multiple disk crashes (con’t)
 In those system which has 4 data disks and 3
redundant disk, how they can correct up to 2 disk
crashes?
• Suppose disk a and b failed:
• find some row r (in 3*7 matrix)in which the column
for a and b are different (suppose a is 0’s and b is 1’s);
• Compute the correct b by taking modulo-2 sum of the
corresponding bits from all the other disks other than
b which have 1’s in row r;
• After getting the correct b, Compute the correct a with
all other disks available;
 Example
4) Coping with
(con’t)_example
multiple
disk
crashes
3*7 matrix
data disk
disk number
1
2
3
1
1
1
0
1
1
0
1
0
1
redundant disk
4
5
6
7
1
0
0
1
0
1
0
1
0
0
1
4) Coping with multiple disk crashes
(con’t)_example
First block of all the disks
disk
1)
2)
3)
4)
5)
6)
7)
contents
11110000
10101010
00111000
01000001
01100010
00011011
10001001
4) Coping with multiple disk crashes
(con’t)_example
Two disks crashes;
disk
1)
2)
3)
4)
5)
6)
7)
contents
11110000
?????????
00111000
01000001
?????????
00011011
10001001
4) Coping with multiple disk crashes
(con’t)_example
In that 3*7 matrix, find in row 2, disk 2 and 5 have different
value and disk 2’s value is 1 and 5’s value is 0.
so: compute the first block of disk 2 by modulo-2 sum of all
the corresponding bits of disk 1,4,6;
then compute the first block of disk 2 by modulo-2 sum of all
the corresponding bits of disk 1,2,3;
1)
2)
3)
4)
5)
6)
11110000
????????? => 00001111
00111000
01000001
????????? => 01100010
00011011
7)
10001001
13.5 Arranging data on disk
Meghna Jain
ID-205
CS257
Prof: Dr. T.Y.Lin
Data elements are represented as records, which stores
in consecutive bytes in same same disk block.
Basic layout techniques of storing data :
Fixed-Length Records
Allocation criteria - data should start at word
boundary.
Fixed Length record header
1. A pointer to record schema.
2. The length of the record.
3. Timestamps to indicate last modified or last read.
Example
CREATE TABLE employee(
name CHAR(30) PRIMARY KEY,
address VARCHAR(255),
gender CHAR(1),
birthdate DATE
);
Data should start at word boundary and contain header and four fields name,
address, gender and birthdate.
Packing Fixed-Length Records into Blocks :
Records are stored in the form of blocks on the disk and they
move into main memory when we need to update or access
them.
A block header is written first, and it is followed by series of
blocks.
Block header contains the following information :





Links to one or more blocks that are part of a network of
blocks.
Information about the role played by this block in such a
network.
Information about the relation, the tuples in this block
belong to.
A "directory" giving the offset of each record in the block.
Time stamp(s) to indicate time of the block's last
modification and/or access.
Example
Along with the header we can pack as many record as we can
in one block as shown in the figure and remaining space will
be unused.
13.6 REPRESENTING BLOCK
AND RECORD ADDRESSES
Ramya Karri
CS257 Section 2
ID: 206
INTRODUCTION

Address of a block and Record

In Main Memory
Address of the block is the virtual memory
address of the first byte
 Address of the record within the block is the
virtual memory address of the first byte of the
record
In Secondary Memory: sequence of bytes describe the
location of the block in the overall system



Sequence of Bytes describe the location of the
block: the device Id for the disk, Cylinder number,
etc.
ADDRESSES IN CLIENT-SERVER
SYSTEMS

The addresses in address space are represented
in two ways



Physical Addresses: byte strings that determine the
place within the secondary storage system where the
record can be found.
Logical Addresses: arbitrary string of bytes of some
fixed length
Physical Address bits are used to indicate:





Host to which the storage is attached
Identifier for the disk
Number of the cylinder
Number of the track
Offset of the beginning of the record
ADDRESSES IN CLIENT-SERVER
SYSTEMS (CONTD..)

Map Table relates logical addresses to physical
addresses.
Logical
Physical
Logical Address
Physical Address
LOGICAL AND STRUCTURED
ADDRESSES
Purpose of logical address?
 Gives more flexibility, when we




Move the record around within the block
Move the record to another block
Gives us an option of deciding what to do when a
record is deleted?
Unused
Rec
ord
4
Offset table
Header
Rec
ord
3
Rec
ord
2
Rec
ord
1
POINTER SWIZZLING
Having pointers is common in an object-relational
database systems
 Important to learn about the management of
pointers
 Every data item (block, record, etc.) has two
addresses:
 database address: address in the server's
database address space which is located on the
secondary storage system.
 memory address: An address in virtual
memory, this is typically 4 bytes, we refer to
such address as the memory address.

POINTER SWIZZLING (CONTD…)

Translation Table: Maps database address to
memory address
Dbaddr
Mem-addr
Database address
Memory Address

All addressable items in the database have entries
in the map table, while only those items currently
in memory are mentioned in the translation table
POINTER SWIZZLING (CONTD…)

Pointer consists of the following two fields
Bit indicating weather the pointer is currently database address or memory
address.
 Database or memory address
Memory
Disk
 Example 13.17

Swizzled
Block 1
Block 1
Unswizzled
Block 2
EXAMPLE 13.7
Block 1 has a record with pointers to a second
record on the same block and to a record on
another block
 If Block 1 is copied to the memory



The first pointer which points within Block 1 can be
swizzled so it points directly to the memory address of
the target record
Since Block 2 is not in memory, we cannot swizzle the
second pointer
POINTER SWIZZLING (CONTD…)

Three types of swizzling

Automatic Swizzling


Swizzling on Demand


As soon as block is brought into memory,
swizzle all relevant pointers.
Only swizzle a pointer if and when it is
actually followed.
No Swizzling

Pointers are not swizzled they are accesses
using the database address.
PROGRAMMER CONTROL OF
SWIZZLING

Unswizzling



When a block is moved from memory back to disk, all
pointers must go back to database (disk) addresses
Use translation table again
Important to have an efficient data structure for the
translation table
PINNED RECORDS AND BLOCKS
A block in memory is said to be pinned if it cannot
be written back to disk safely.
 If block B1 has swizzled pointer to an item in
block B2, then B2 is pinned




Unpin a block, we must unswizzle any pointers to it
Keep in the translation table the places in memory
holding swizzled pointers to that item
Unswizzle those pointers (use translation table to
replace the memory addresses with database (disk)
addresses
Eswara Satya Pavan Rajesh Pinapala
CS 257
ID: 221
Topics
 Records with Variable Length Fields
 Records with Repeating Fields
 Variable Format Records
 Records that do not fit in a block
 BLOBS
Example
name
0
297
gender
address
30
286
birth date
287
Fig 1 : Movie star record with four fields
Records with Variable Fields
If one or more fields of a record have variable length,
then record must contain enough information. An
effective way to represent variable length records is as
follows:
 Fixed length fields are Kept ahead of the variable
length fields
 Record header contains
• Length of the record
• Pointers to the beginning of all variable
length fields except the first one.
Records with Variable Length
Fields
header information
record length
to address
gender
birth date
name
address
Figure 2 : A Movie Star record with name and address implemented as
variable length character strings
Records with Repeating Fields
Records contains variable number of occurrences of a field F
All occurrences of field F are grouped together and the record
header contains a pointer to the first occurrence of field F
 L bytes are devoted to one instance of field F
 Locating an occurrence of field F within the record
• Add to the offset for the field F which are the integer
multiples of L starting with 0 , L ,2L,3L and so on to locate
•We stop upon reaching the offset of the field F.
Records with Repeating Fields
other header
information
record length
to address
to movie pointers
name
address
pointers to movies
Figure 3 : A record with a repeating group of references to movies
Records with Repeating Fields
record header
information
address
to name
length of name
to address
length of address
to movie references
number of references
name
Figure 4 : Storing variable-length fields separately from the record
Records with Repeating Fields
Advantage
 Keeping the record itself fixed length allows record to be
searched more efficiently, minimizes the overhead in the block
headers, and allows records to be moved within or among the
blocks with minimum effort.
Disadvantage
 Storing variable length components on another block increases
the number of disk I/O’s needed to examine all components of a
record.
Records with Repeating Fields
A compromise strategy is to allocate a fixed portion of
the record for the repeating fields
 If the number of repeating fields is lesser than
allocated space, then there will be some unused space
 If the number of repeating fields is greater than
allocated space, then extra fields are stored in a
different location and
Pointer to that location and count of additional
occurrences is stored in the record
Variable Format Records
 Records that do not have fixed schema
 Variable format records are represented by sequence of
tagged fields
 Each of the tagged fields consist of information
• Attribute or field name
• Type of the field
• Length of the field
• Value of the field
 Why use tagged fields
• Information – Integration applications
• Records with a very flexible schema
Variable Format Records
code for name
code for string type
length
N
S
14
Clint Eastwood
code for restaurant owned
code for string type
length
R
Fig 5 : A record with tagged fields
S
16
Hog’s Breath Inn
Records that do not fit in a block
 When the length of a record is greater than block size ,then
then record is divided and placed into two or more blocks
 Portion of the record in each block is referred to as a
RECORD FRAGMENT
 Record with two or more fragments is called
SPANNED RECORD
 Record that do not cross a block boundary is called
UNSPANNED RECORD
Spanned Records
Spanned records require the following extra header
information
• A bit indicates whether it is fragment or not
• A bit indicates whether it is first or last fragment of
a record
• Pointers to the next or previous fragment for the
same record
Records that do not fit in a block
block header
record header
record 1
block 1
record
2-a
record
2-b
block 2
Figure 6 : Storing spanned records across blocks
record 3
BLOBS
 Large binary objects are called BLOBS e.g. :
audio files, video files
Storage of BLOBS
Retrieval of BLOBS
Record Modifications
Chapter 13
Section 13.8
Neha Samant
CS 257
(Section II) Id 222
111
Modification types

Insertion

Deletion

Update
112
Introduction
• What is Record ?
Record is a single, implicitly structured data item in the
database table. Record is also called as Tuple
• What is definition of Record Modification ?
We say Records Modified when a data manipulation operation is
performed.
113
Insertion

Insertion of records without order
Records can be placed in a block with empty space or in a new block.
Insertion of records in fixed order


Space available in the block
No space available in the block (outside the block)
Structured address
Pointer to a record from outside the block.
114
Insertion in fixed order
Space available within the block


Use of an offset table in the header of each block with pointers to the location of
each record in the block.
The records are slid within the block and the pointers in the offset table are
adjusted.
Offset
table
header
unused
Record 4
Record 3
Record 2
Record 1
115
Insertion in fixed order
No space available within the block (outside the block)

Find space on a “nearby” block.
•
•

In case of no space available on a block, look at the following block in sorted order of
blocks.
If space is available in that block ,move the highest records of first block 1 to block 2
and slide the records around on both blocks. If there are external pointer then we have
to leave forwarding address in the offset table of block 1 to say that a certain record
has been moved to block 2 and its entry is in the offset table of block2.
Create an overflow block
•
•
•
Records can be stored in overflow block.
Each block has place for a pointer to an overflow block in its header.
The overflow block can point to a second overflow block as shown below.
Block
B
Overflow
block for B
116
Deletion

Recover space after deletion

When using an offset table, the records can be slid around the block so there
will be an unused region in the center that can be recovered.

In case we cannot slide records, an available space list can be maintained in
the block header.

The list head goes in the block header and available regions hold the links in
the list.
117
Deletion

Use of tombstone

The tombstone is placed in a record in order to avoid pointers to the deleted
record to point to new records.

The tombstone is permanent until the entire database is reconstructed.

If pointers go to fixed locations from which the location of the record is
found then we put the tombstone in that fixed location. (See examples)

Where a tombstone is placed depends on the nature of the record pointers.

Map table is used to translate logical record address to physical address.
118
Deletion

Use of tombstone

If we need to replace records by tombstones, place the bit that serves as the
tombstone at the beginning of the record.

This bit remains the record location and subsequent bytes can be reused for
another record
Record 1
Record 2
Record 1 can be replaced, but the tombstone remains, record 2 has no
tombstone and can be seen when we follow a pointer to it.
119
Update

Fixed Length update
No effect on storage system as it occupies same space as before
update.

Variable length update


Longer length
Short length
120
Update
Variable length update (longer length)

Stored on the same block:



Sliding records
Creation of overflow block.
Stored on another block


Move records around that block
Create a new block for storing variable length fields.
121
Update
Variable length update (Shorter length)

Same as deletion


Recover space
Consolidate space.
122
BTrees & Bitmap Indexes
14.2
DATABASE SYSTEMS – The Complete
Book
Presented By:
Under the supervision of:
Maciej Kicinski
Dr. T.Y.Lin

B Trees ►►
Structure

A balanced tree, meaning that all paths from the
p leaf node have the same length.

There is a parameter n associated with each Btree
香 block. Each block will have space for n search keys
and n+1 pointers.

The root may have only 1 parameter, but all other
blocks most be at least half full.
Structure

● A typical
node >

● a typical interior

node would have

pointers pointing to

leaves without

values

● a typical leaf would

have pointers point

to records

N search keys

N+1 pointers
Application

The search key of the Btree is the primary key for
the data file, that is, there is one key-pointer pair in
a leaf for every record of the data file.

Data file is sorted by its primary key and the B-tree
is a index with one key-pointer pair at a leaf for
each block of the data file.

Data file is sorted by an attribute that is not a
key,and this attribute is the search key for the Btree.
For each key value K that appears in the data file
there is one key-pointer pair at a leaf. That pointer
goes to the first of the records that have K as their
sort-key value.
Lookup

If at an interior node, choose the correct pointer to use.
This is done by comparing keys to search value.
Lookup




If at a leaf node, choose the key that matches what
you are looking for and the pointer for that leads to
the data.
Suppose we want to find a record with search key
40.
We will start at the root , the root is 13, so the
record will go the right of the tree.
Then keep searching with the same concept.
Insertion

When inserting, choose the correct leaf node to put
pointer to data.

If node is full, create a new node and split keys
between the two.

Recursively move up, if cannot create new pointer
to new node because full, create new node.

This would end with creating a new root node, if
L the current root was full.
Deletion

Perform lookup to find node to delete and delete it.

If node is no longer half full, perform join on
adjacent node and recursively delete up, or key
move if that node is full and recursively change
pointer up.
Efficiency


Btrees allow lookup, insertion, and deletion of
records using very few disk I/Os.
Each level of a Btree would require one read.
Then you would follow the pointer of that to the
next or final read.
Efficiency


Three levels are sufficient for Btrees. Having each
block have 255 pointers, 255^3 is about 16.6 million.
You can even reduce disk I/Os by keeping a level of a
Btree in main memory. Keeping the first block with 255
pointers would reduce the reads to 2, and even possible to
keep the next 255 pointers in memory to reduce reads to 1.
BTrees & Bitmap Indexes
14.7
DATABASE SYSTEMS – The Complete
Book
Presented By:
Under the supervision of:
Deepti Kundu
Dr. T.Y.Lin
Bitmap Indexes ►►
Definition
A bitmap index for a field F is a collection of
bit-vectors of length n, one for each possible
value that may appear in that field F.[1]
What does that mean?
• Assume relation R
with
– 2 attributes A and B.
– Attribute A is of type
Integer and B is of
type String.
– 6 records, numbered 1
through 6 as shown.
1
2
3
4
5
6
A
30
30
40
50
40
30
B
foo
bar
baz
foo
bar
baz
Example Continued…
• A bitmap for attribute B is:
Value
Vector
A
B
1
30
foo
2
30
bar
foo
100100
3
40
baz
bar
010010
4
50
foo
baz
001001
5
40
bar
6
30
baz
Where do we reach?


A bitmap index is a special kind of database index
that uses bitmaps.[2]
Bitmap indexes is considered to work well for data
such as gender, which has a small number of distinct
values, e.g., male and female, but many occurrences
of those values.[2]
A little more…



A bitmap index for attribute A of relation R is:
 A collection of bit-vectors
 The number of bit-vectors = the number of distinct values
of A in R.
 The length of each bit-vector = the cardinality of R.
 The bit-vector for value v has 1 in position i, if the ith
record has v in attribute A, and it has 0 there if not.[3]
Records are allocated permanent numbers.[3]
There is a mapping between record numbers and record
addresses.[3]
Motivation for Bitmap Indexes

Very efficient when used for partial match queries.[3]

They offer the advantage of buckets [2]


Where we find tuples with several specified attributes
without first retrieving all the record that matched in
each of the attributes.
They can also help answer range queries [3]
Another Example
Multidimensional Array of multiple types
{(5,d),(79,t),(4,d),(79,d),(5,t),(6,a)}
5 = 100010
79= 010100
4 = 001000
6 = 000001
d = 101100
t = 010010
a = 000001
Example Continued…
{(5,d),(79,t),(4,d),(79,d),(5,t),(6,a)}
Searching for items is easy, just AND together.
To search for (5,d)
5 = 100010
d = 101100
100010 AND 101100 = 100000
The location of the
record has been traced!
Compressed Bitmaps

Assume:

The number of records in R are n
 Attribute A has m distinct values in R
The size of a bitmap index on attribute A is m*n.

If m is large, then the number of 1’s will be around 1/m.

Opportunity to encode
A common encoding approach is called run-length
encoding.[1]


Run-length encoding

Represents runs


A run of i 0’s followed by a 1 is encoded by:



A run is a sequence of i 0’s followed by a 1, by some suitable binary encoding of
the integer i.
First computing how many bits are needed to represent i, Say k
Then represent the run by k-1 1’s and a single 0 followed by k bits which
represent i in binary.

The encoding for i = 1 is 01. k = 1

The encoding for i = 0 is 00. k = 1
We concatenate the codes for each run together, and the sequence of bits is the
encoding of the entire bit-vector
Understanding with an Example

Let us decode the sequence 11101101001011

Staring at the beginning (left most bit):

First run: The first 0 is at position 4, so k = 4. The next 4 bits are
1101, so we know that the first integer is i = 13
 Second run: 001011
 k = 1
 i = 0
 Last run: 1011
 k = 1
 i = 3
Our entire run length is thus 13,0,3, hence our bit-vector is:

0000000000000110001
Managing Bitmap Indexes
1) How to find a specific bit-vector for a value
efficiently?
2) After selecting results that match, how to retrieve
the results efficiently?
3) When data is changed, do you alter bitmap index?
1) Finding bit vectors



Think of each bit-vector as a key to a value.[1]
Any secondary storage technique will be efficient
in retrieving the values.[1]
Create secondary key with the attribute value as a
search key [3]
 Btree
 Hash
2) Finding Records


Create secondary key with the record number as a search
key [3]
Or in other words,

For finding record k, create a secondary index using the
kth position as a search key.[1]
3) Handling Modifications
Two things to remember:
Record numbers can not changed once assigned
Changes to data file require changes to bitmap index
Deletion
 Tombstone replaces deleted record
 Corresponding bit is set to 0
Insertion
 Record is assigned next record number.
 A bit of value 0 or 1 is appended to each bit
vector
 If new record contains a new value of the
attribute, add one bit-vector.
Modification
 Change the bit corresponding to the old
value of the modified record to 0
 Change the bit corresponding to the new
value of the modified record to 1
 If the new value is a new value of A, then
insert a new bit-vector.
Concurrency Control
18.1 – 18.2
Chiu Luk
CS257 Database Systems Principles
Spring 2009
Concurrency Control

Concurrency control in database management systems (DBMS)
ensures that database transactions are performed concurrently without
the concurrency violating the data integrity of a database.

Executed transactions should follow the ACID rules. The DBMS must
guarantee that only serializable (unless Serializability is intentionally
relaxed), recoverable schedules are generated.

It also guarantees that no effect of committed transactions is lost, and
no effect of aborted (rolled back) transactions remains in the related
database.
Transaction ACID rules
Atomicity - Either the effects of all or none of its operations
remain when a transaction is completed - in other words, to
the outside world the transaction appears to be indivisible,
atomic.
Consistency - Every transaction must leave the database in
a consistent state.
Isolation - Transactions cannot interfere with each other.
Providing isolation is the main goal of concurrency control.
Durability - Successful transactions must persist through
crashes.
Serial and Serializable Schedules
In the field of databases, a schedule is a list of actions, (i.e. reading, writing,
aborting, committing), from a set of transactions.
 In this example, Schedule D is the set of 3 transactions T1, T2, T3. The schedule
describes the actions of the transactions as seen by the DBMS. T1 Reads and
writes to object X, and then T2 Reads and writes to object Y, and finally T3 Reads
and writes to object Z. This is an example of a serial schedule, because the
actions of the 3 transactions are not interleaved.

Serial and Serializable Schedules

A schedule that is equivalent to a serial schedule has the serializability property.

In schedule E, the order in which the actions of the transactions are executed is not the same as in D,
but in the end, E gives the same result as D.
Serial Schedule TI

T1

Read(A); A  A+100

Write(A);

Read(B); B  B+100;

Write(B);
precedes T2
T2
A
25
125
125
Read(A);A 
250

Write(A);

Read(B);B 
250
250

A2;
B2;

B
25
Write(B);
250
Serial Schedule T2 precedes Tl

T1

A2;
A
25
T2
Read(A);A 
50

Write(A);

Read(B);B 
50
Write(B);
150
B2;


Read(A); A  A+100

Write(A);

Read(B); B  B+100;

Write(B);
B
25
150
150
150
serializable, but not serial, schedule

T1

Read(A); A  A+100

Write(A);

A2;
T2

Read(B); B  B+100;

Write(B);
Read(A);A 
Read(B);B 
B2;
r1(A);
w1 (A): r2(A); w2(A); r1 (B); w1 (B); r2(B); w2(B);

B
25
125
Write(A);


A
25
Write(B);
250
125
250
250
250
nonserializable schedule

T1

Read(A); A  A+100

Write(A);

A2;
A
25
T2
125
Read(A);A 

Write(A);

Read(B);B 
B2;
Write(B);


Read(B); B  B+100;

Write(B);
B
25
250
50
150
250
150
schedule that is serializable only because of the detailed
behavior of the transactions

T1

Read(A); A  A+100

Write(A);
T2’
Read(A);A  A1;

Write(A);

Read(B);B  B1;

Write(B);
Read(B); B  B+100;

Write(B);
B
25
125


A
25
125
25
125
125
125
Non-Conflicting Actions
Two actions are non-conflicting if whenever they
occur consecutively in a schedule, swapping them
does not affect the final state produced by the
schedule. Otherwise, they are conflicting.
Conflicting Actions: General Rules

Two actions of the same transaction conflict:
 r1(A)

w1(B)
Two actions over the same database element
conflict, if one of them is a write
 r1(A)
w2(A)
 w1(A)
w2(A)
Conflict actions


Two or more actions are said to be in conflict if:

The actions belong to different transactions.

At least one of the actions is a write operation.

The actions access the same object (read or write).
The following set of actions is conflicting:


T1:R(X), T2:W(X), T3:W(X)
While the following sets of actions are not:

T1:R(X), T2:R(X), T3:R(X)

T1:R(X), T2:W(Y), T3:R(X)
Conflict Serializable
We may take any schedule and make as many
nonconflicting swaps as we wish.

With the goal of turning the schedule into a serial
schedule.

If we can do so, then the original schedule is
serializable, because its effect on the database
state remains the same as we perform each of the
nonconflicting swaps.

Conflict Serializable

A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to one or more
serial schedules.

Another definition for conflict-serializability is that a schedule is conflict-serializable if and only if
there exists an acyclic precedence graph/serializability graph for the schedule.

Which is conflict-equivalent to the serial schedule <T1,T2>, but not <T2,T1>.
Conflict equivalent / conflict-serializable

Let Ai and Aj are consecutive non-conflicting actions that belongs
to different transactions. We can swap Ai and Aj without changing the
result.

Two schedules are conflict equivalent if they can be turned one into
the other by a sequence of non-conflicting swaps of adjacent actions.

We shall call a schedule conflict-serializable if it is conflictequivalent to a serial schedule.
conflict-serializable
T1
T2
R(A)
W(A)
R(A)
R(B)
W(A)
W(B)
R(B)
W(B)
conflict-serializable
T1
T2
R(A)
W(A)
R(B)
R(A)
W(A)
W(B)
R(B)
W(B)
conflict-serializable
T1
T2
R(A)
W(A)
R(A)
R(B)
W(B)
W(A)
R(B)
W(B)
conflict-serializable
T1
T2
R(A)
W(A)
Serial
Schedule
R(A)
W(B)
R(B)
W(A)
R(B)
W(B)
Concurrency Control
By Donavon Norwood
174
INTRODUCTION

Enforcing serializability by locks




Locks
Locking scheduler
Two phase locking
Locking systems with several lock modes




Shared and exclusive locks
Compatibility matrices
Upgrading/updating locks
Incrementing locks
175
Locks
It works like as follows :




A request from transaction
Scheduler checks in the lock table
Generates a serializable schedule of actions.
176
Consistency of transactions

Actions and locks must relate each other


Transactions can only read & write only if has a
lock and has not released the lock.
Unlocking an element is compulsory.
Legality of schedules


No two transactions can acquire the lock on
same element.
No transaction can acquire the lock on element
without the prior one release it.
177
Locking scheduler


Grants lock requests only if it is in a legal
schedule.
Lock table stores the information about current
locks on the elements.
178
The locking scheduler (contd.)

A legal schedule of consistent transactions but
unfortunately it is not a serializable.
179
Locking schedule (contd.)

The locking scheduler delays requests that
would result in an illegal schedule.
180
Two-phase locking



Guarantees a legal schedule of consistent
transactions is conflict-serializable.
All lock requests proceed all unlock requests.
The growing phase:


Obtain all the locks and no unlocks allowed.
The shrinking phase:

Release all the locks and no locks allowed.
181
Working of Two-Phase locking


Assures serializability.
Two protocols for 2PL:



Strict two phase locking : Transaction holds all
its exclusive locks till commit / abort.
Rigorous two phase locking : Transaction holds
all locks till commit / abort.
Possible to find a transaction Tj that has a
2PL and a schedule S for Ti ( non 2PL ) and
Tj that is not conflict serializable.
182
Failure of 2PL.

2PL fails to provide security against
deadlocks.
183
Concurrency Control:
18.4 Locking Systems with
Several Lock Modes
CS257 Spring/2009
Professor: Tsau Lin
Student: Suntorn Sae-Eung
ID: 212
184
18.4 Locking Systems with
Several Lock Modes

In 18.3, if a transaction must lock a database
element (X) either reads or writes,


No reason why several transactions could not
read X at the same time, as long as none write X
Introduce locking schemes


Shared/Read Lock ( For Reading)
Exclusive/Write Lock( For Writing)
185
18.4.1 Shared & Exclusive
Locks

Transaction Consistency

Cannot write without Exclusive Lock
Cannot read without holding some lock

Consider lock for writing is “stronger” than for reading


This basically works on 2 principles
1. A read action can only proceed a shared or an exclusive
lock
2. A write lock can only proceed a exclusive lock

All locks need to be unlocked before commit
186
18.4.1 Shared & Exclusive
Locks (cont.)

Ti

Two-phase locking (2PL) of transactions
Lock  R/W  Unlock
Notation:
sli (X)– Ti requests shared lock on DB element X
xli (X)– Ti requests exclusive lock on DB element X
ui (X)– Ti relinquishes whatever lock on X
187
18.4.1 Shared & Exclusive Locks
(cont.)

Legality of Schedules

An element may be locked by: one write transaction or by
several read transactions shared mode, but not both
18.4.2 Compatibility Matrices

A convenient way to describe lockmanagement policies



Rows correspond to a lock held on an element by
another transaction
Columns correspond to mode of lock requested.
Example :
Lock requested
S
X
Lock in
S
YES
NO
hold
X
NO
NO
189
18.4.3 Upgrading Locks

A transaction (T) taking a shared lock is friendly
toward other transaction.

When T wants to read and write a new value X,
1. T takes a shared lock on X.
2. performs operations on X (may spend long time)
3. When T is ready to write a new value, “Upgrade” shared lock to
exclusive lock on X.
190
18.4.3 Upgrading Locks (cont.)

Observe the example
T1 retry and
succeed

‘B’ is released
T cannot take an exclusive lock on B until all locks on B are
18.4.3 Upgrading Locks (cont.)

Upgrading can simply cause a “Deadlock”.

Both the transactions want to upgrade on the
same element
Both transactions will wait forever !!
192
18.4.4 Update locks

The third lock mode resolving the
deadlock problem, which rules are




Only “Update lock” can be upgraded to a write
(exclusive) lock later.
An “Update lock” is allowed to grant on X when
there are already shared locks on X.
Once there is an “Update lock,” it prevents
additional any kinds of lock, and later changes to
a write (exclusive) lock.
Notation: uli (X)
193
18.4.4 Update locks (cont.)

Example
18.4.4 Update locks (cont.)
• Compatibility matrix (asymmetric)
Lock requested
Lock in
hold
S
X
U
S
YES
NO
YES
X
NO
NO
NO
U
NO
NO
NO
195
18.4.5 Increment Locks

A useful lock for transactions which
increase/decrease value.
e.g. money transfer between two bank accounts.

If 2 transactions (T1, T2) add constants to the
same database element (X),


It doesn’t matter which goes first, but no reads are
allowed in between transaction processing
Let see on following exhibits
196
18.4.5 Increment Locks (cont.)
CASE 1
T1: INC (A,2)
A=7
A=5
T2: INC
(A,10)
CASE 2
T2: INC
(A,10)
A=17
A=15
T1: INC (A,2)
18.4.5 Increment Locks (cont.)

What if
T1: INC (A,2)
A=5
T2: INC
(A,10)
A=15
A=5
A=15
T2: INC
(A,10)
A=5
A=7
A=5
A != 17
T1: INC (A,2)
A=7
18.4.5 Increment Locks (cont.)

INC (A, c) –

Increment action of writing on database element A,
which is an atomic execution consisting of
1. READ(A,t);
� 2. t = t+c;
3. WRITE(A,t);

Notation:

ili (X)– action of Ti requesting an increment lock on X

inci (X)– action of Ti increments X by some constant; don’t
care about the value of the constant.
18.4.5 Increment Locks (cont.)

Example
18.4.5 Increment Locks (cont.)
• Compatibility matrix
Lock requested
Lock in
hold
S
X
I
S
YES
NO
NO
X
NO
NO
NO
I
NO
NO
YES
201
Concurrency Control
Chapter 18
Section 18.5
Presented by
Khadke, Suvarna
CS 257
(Section II) Id 213
202
Overview


Assume knowledge of:

Lock

Two phase lock

Lock modes: shared, exclusive, update
A simple scheduler architecture based on
following principle :


Insert lock actions into the stream of reads, writes, and
other actions
Release locks when the transaction manager tells it that
the transaction will commit or abort
203
Scheduler That Inserts Lock
Actions into the transactions
request stream
Scheduler That Inserts Lock
Actions
If transaction is delayed, waiting for a lock,
Scheduler performs following actions



Part I: Takes the stream of requests generated by
the transaction & insert appropriate lock modes to
db operations (read, write, or update)
Part II: Take actions (a lock or db operation) from
Part I and executes it.
Determine the transaction (T) that action belongs
and status of T (delayed or not). If T is not
delayed then
1.
Database access action is transmitted to the database and
205
execute it.
Scheduler That Inserts Lock
Actions
2.
If lock action is received by PartII, it checks the L Table whether
lock can be granted or not
i> Granted, the L Table is modified to include granted lock
ii>Not G. then update L Table about requested lock then PartII
delays transaction T
3.
When a T = commits or aborts, PartI is notified by the transaction
manager and releases all locks.
If any transactions are waiting for locks PartI notifies PartII.
4. Part II when notified about the lock on some DB element,
determines next transaction T’ to get lock to continue.
206
The Lock Table


A relation that associates database elements with
locking information about that element
Implemented with a hash table using database
elements as the hash key
Size is proportional to the number of lock
elements only, not to the size of the entire
DB database

element A
Lock
information
for A
207
Lock Table Entries Structure
Some Sort of information
found in Lock Table entry
1>Group modes
-S: only shared locks are held
-X: one exclusive lock and
no other locks
- U: one update lock and one
or more shared locks
2>wait : one transaction
waiting for a lock on A
3>A list : T currently hold
locks on A or Waiting for
lock on A
208
Handling Lock Requests



Suppose transaction T requests a lock on A
If there is no lock table entry for A, then there are
no locks on A, so create the entry and grant the
lock request
If the lock table entry for A exists, use the group
mode to guide the decision about the lock request
209
Handling Lock Requests

If group mode is U (update) or X (exclusive)
No other lock can be granted


Deny the lock request by T

Place an entry on the list saying T requests a lock

And Wait? = ‘yes’
If group mode is S (shared)
Another shared or update lock can be granted

Grant request for an S or U lock

Create entry for T on the list with Wait? = ‘no’

Change group mode to U if the new lock is an update lock
210
Handling Unlock Requests




Now suppose transaction T unlocks A
Delete T’s entry on the list for A
If T’s lock is not the same as the group mode, no
need to change group mode
Otherwise check entire list for new group mode
S: GM(S) or nothing
 U: GM(S) or nothing
 X: nothing

211
Handling Unlock Requests
 If the value of waiting is “yes" need to grant one or
more locks using following approaches
First-Come-First-Served:
 Grant the lock to the longest waiting request.
 No starvation (waiting forever for lock)
Priority to Shared Locks:
 Grant all S locks waiting, then one U lock.
 Grant X lock if no others waiting
Priority to Upgrading:
 If there is a U lock waiting to upgrade to an X
lock, grant that first.
212
Concurrency Control
Managing Hierarchies of Database Elements (18.6)
Presented by
Ronak Shah
(214)
March 9, 2009
213
Managing Hierarchies of
Database Elements

Two problems that arise with locks when there
is a tree structure to the data are:

Tree structure is a hierarchy of lockable elements determine how locks are granted for both large elements
(relations) and smaller elements (blocks containing
tuples or individual tuples)

When the data itself is organized as a tree (B-tree
indexes)
Locks with Multiple Granularity



A database element can be a relation, block or a
tuple
Different systems use different database elements
to determine the size of the lock
Thus some may require small database elements
such as tuples or blocks and others may require
large elements such as relations
Example of Multiple Granularity
Locks

Consider a database for a bank




Choosing relations as database elements means we
would have one lock for an entire relation
If we were dealing with a relation having account
balances, this kind of lock would be very inflexible and
thus provide very little concurrency
Why? Because balance transactions require exclusive
locks and this would mean only one transaction occurs
for one account at any time
But as each account is independent of others we could
perform transactions on different accounts
simultaneously
…(contd.)


Thus it makes sense to have block element for the lock
so that two accounts on different blocks can be updated
simultaneously
Another example is that of a document

With similar arguments as above, we see that it is better
to have large element (a complete document) as the lock
in this case
Warning (Intention) Locks

These are required to manage locks at different
granularities


In the bank example, if the a shared lock is obtained for the
relation while there are exclusive locks on individual tuples,
unserializable behavior occurs
The rules for managing locks on hierarchy of
database elements constitute the warning protocol
Database Elements Organized in
Hierarchy
Rules of Warning Protocol


These involve both ordinary (S and X) and
warning (IS and IX) locks
The rules are:




Begin at the root of hierarchy
Request the S/X lock if we are at the desired element
If the desired element id further down the hierarchy, place a
warning lock (IS if S and IX if X)
When the warning lock is granted, we proceed to the child
node and repeat the above steps until desired node is
reached
Compatibility Matrix for Shared,
Exclusive and Intention Locks
IS
IX
S
X
IS
Yes
Yes
Yes
No
IX
Yes
Yes
No
No
S
Yes
No
Yes
No
X
No
No
No
No
• The above matrix applies only to locks held by
other transactions
Group Modes of Intention Locks



An element can request S and IX locks at the same
time if they are in the same transaction (to read
entire element and then modify sub elements)
This can be considered as another lock mode, SIX,
having restrictions of both the locks i.e. No for all
except IS
SIX serves as the group mode
Example

Consider a transaction T1 as follows



Select * from table where attribute1 = ‘abc’
Here, IS lock is first acquired on the entire relation; then
moving to individual tuples (attribute = ‘abc’), S lock in
acquired on each of them
Consider another transaction T2


Update table set attribute2 = ‘def’ where attribute1 = ‘ghi’
Here, it requires an IX lock on relation and since T1’s IS
lock is compatible, IX is granted


On reaching the desired tuple (ghi), as there is no lock,
it gets X too
If T2 was updating the same tuple as T1, it would have
to wait until T1 released its S lock
Phantoms and Handling
Insertions Correctly



This arises when transactions create new sub
elements of lockable elements
Since we can lock only existing elements the new
elements fail to be locked
The problem created in this situation is explained
in the following example
Example

Consider a transaction T3



Select sum(length) from table where attribute1 = ‘abc’
This calculates the total length of all tuples having
attribute1
Thus, T3 acquires IS for relation and S for targeted
tuples
 Now, if another transaction T4 inserts a new tuple
having attribute1 = ‘abc’, the result of T3 becomes
incorrect
Example (…contd.)


This is not a concurrency problem since the serial
order (T3, T4) is maintained
But if both T3 and T4 were to write an element X,
it could lead to unserializable behavior




r3(t1);r3(t2);w4(t3);w4(X);w3(L);w3(X)
r3 and w3 are read and write operations by T3 and w4 are
the write operations by T4 and L is the total length
calculated by T3 (t1 + t2)
At the end, we have result of T3 as sum of lengths of t1 and
t2 and X has value written by T3
This is not right; if value of X is considered to be that
written by T3 then for the schedule to be serializable, the
sum of lengths of t1, t2 and t3 should be considered
Example (…contd.)



Else if the sum is total length of t1 and t2 then for the
schedule to be serializable, X should have value written by
T4
This problem arises since the relation has a
phantom tuple (the new inserted tuple), which
should have been locked but wasn’t since it didn’t
exist at the time locks were taken
The occurrence of phantoms can be avoided if all
insertion and deletion transactions are treated as
write operations on the whole relation
SECTION 18.7
THE TREE PROTOCOL
By :
Saloni Tamotia (215)

B-Trees
BASICS
- Tree data structure that keeps data sorted
- allow searches, insertion, and deletion
- commonly used in database and file systems

Lock
- Enforce limits on access to resources
- way of enforcing concurrency control

Lock Granularity
- Level and type of information that lock
TREE PROTOCOL






Kind of graph-based protocol
Alternate to Two-Phased
Locking (2PL)
database elements are disjoint
pieces of data
Nodes of the tree DO NOT
form a hierarchy based on
containment
Way to get to the node is
through its parent
Example: B-Tree
ADVANTAGES OF TREE
PROTOCOL


Unlocking takes less time as
compared to 2PL
Freedom from deadlocks
18.7.1 MOTIVATION FOR
TREE-BASED LOCKING
Consider B-Tree Index, treating individual
nodes as lockable database elements.
 Concurrent use of B-Tree is not possible
with standard set of locks and 2PL.
 Therefore, a protocol is needed which can
assure serializability by allowing access to
the elements all the way at the bottom of the
tree even if the 2PL is violated.

18.7.1 MOTIVATION FOR
TREE-BASED LOCKING (cont.)
Reason for : “Concurrent use of B-Tree is not
possible with standard set of locks and 2PL.”
every transaction must begin with locking
the root node
 2PL transactions can not unlock the root
until all the required locks are acquired.

18.7.2 ACCESSING TREE
STRUCTURED DATA
Assumptions:
Only one kind of lock
 Consistent transactions
 Legal schedules
 No 2PL requirement on transaction

18.7.2 RULES FOR ACCESSING
TREE STRUCTURED DATA
RULES:
First lock can be at any node.
 Subsequent locks may be acquired only after
parent node has a lock.
 Nodes may be unlocked any time.
 No relocking of the nodes even if the node’s
parent is still locked

18.7.3 WHY TREE
PROTOCOL WORKS?
Tree protocol implies a serial order on
transactions in the schedule.
 Order of precedence:

Ti < s Tj

If Ti locks the root before Tj, then Ti locks
every node in common with Tj before Tj.
ORDER OF PRECEDENCE
SECTION 18.8
Timestamps
What is Timestamping?
Scheduler assign each transaction T a unique
number, it’s timestamp TS(T).
 Timestamps must be issued in ascending
order,
 at the time when a transaction first notifies the
scheduler that it is beginning.

Timestamp TS(T)

Two methods of generating Timestamps.
Use the value of system, clock as the timestamp.
 Use a logical counter that is incremented after a
new timestamp has been assigned.


Scheduler maintains a table of currently active
transactions and their timestamps irrespective
of the method used
Timestamps for database element X
and commit bit



RT(X):- The read time of X, which is the highest
timestamp of transaction that has read X.
WT(X):- The write time of X, which is the highest
timestamp of transaction that has write X.
C(X):- The commit bit for X, which is true if and
only if the most recent transaction to write X has
already committed.
Physically Unrealizable Behavior
Read too late:

A transaction U that started after transaction T, but
wrote a value for X before T reads X.
U writes X
T reads X
T start
U start
Physically Unrealizable Behavior
Write too late

A transaction U that started after T, but read X
before T got a chance to write X.
U reads X
T writes X
T start
U start
Figure: Transaction T tries to write too late
Dirty Read

It is possible that after T reads the value of X
written by U, transaction U will abort.
U writes X
T reads X
U start
T start
U aborts
T could perform a dirty read if it reads X when shown
Rules for Timestamps-Based
scheduling
Granting Request
2.
Aborting T (if T would violate physical reality) and
restarting T with a new timestamp (Rollback)
3.
Delaying T and later deciding whether to abort T or to
grant the request
Scheduler receives a request r
1.
a) If TS(T) ≥ WT(X), the read is physically realizable.
1. If C(X) is true, grant the request, if TS(T) > RT(X), set
RT(X) := TS(T); otherwise do not change RT(X).
2. If C(X) is false, delay T until C(X) becomes true or
transaction that wrote X aborts.
b) If TS(T) < WT(X), the read is physically unrealizable.
Rollback T.
Rules for Timestamps-Based
scheduling (Cont.)
2. Scheduler receives a request WT(X).
a) if TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is
physically realizable and must be performed.
1. Write the new value for X,
2. Set WT(X) := TS(T), and
3. Set C(X) := false.
b) if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is
physically realizable, but there is already a later values in
X.
a. If C(X) is true, then the previous writers of X is
committed, and ignore the write by T.
b. If C(X) is false, we must delay T.
Rules for Timestamps-Based
scheduling (Cont.)
3. Scheduler receives a request to commit T. It must find
all the database elements X written by T and set C(X) :=
true. If any transactions are waiting for X to be
committed, these transactions are allowed to proceed.
4. Scheduler receives a request to abort T or decides to
rollback T, then any transaction that was waiting on an
element X that T wrote must repeat its attempt to read or
write.
Multiversion Timestamps




keeps old versions of data item to increase
concurrency.
Each successful write results in the creation of a
new version of the data item written.
Use timestamps to label versions.
When a read(X) operation is issued, select an
appropriate version of X based on the timestamp of
the transaction, and return the value of the selected
version.
Timestamps and Locking

Generally, timestamping performs better than
locking in situations where:
Most transactions are read-only.
 It is rare that concurrent transaction will try to read
and write the same element.
In high-conflict situations, rollback will be frequent,
introducing more delays than a locking system

By
Swathi Vegesna
217
At a Glance
 Introduction
 Validation based scheduling
 Validation based Scheduler
 Expected exceptions
 Validation rules
 Example
 Comparisons
 Summary



Introduction
Optimistic concurrency control Timestamp- based scheduling - where we allow
transactions to access data without locks, and at the
appropriate time we check that the transaction has
behaved in a searializable manner.
Validation-based scheduling - that the scheduler
maintains a record of what active transactions are
doing, rather than keeping read and write time for
all database elements.
Validation based scheduling
 Scheduler keeps a record of what the active transactions
are doing.
 Executes in 3 phases
1.
2.
3.
Read- reads from RS( ), computes local address
Validate- compares read and write sets
Write- writes from WS( )
Validation based Scheduler
 Contains an assumed serial order of transactions.
 Maintains three sets:
 START( ): set of T’s started but not completed validation.
 VAL( ): set of T’s validated but not finished the writing
phase.
 FIN( ): set of T’s that have finished.
Expected exceptions
1. Suppose there is a transaction U, such that:
 U is in VAL or FIN; that is, U has validated,
 FIN(U)>START(T); that is, U did not finish before T started
 RS(T) ∩WS(T) ≠φ; let it contain database element X.
2. Suppose there is transaction U, such that:
• U is in VAL; U has successfully validated.
•FIN(U)>VAL(T); U did not finish before T entered its validation phase.
•WS(T) ∩ WS(U) ≠φ; let x be in both write sets.
Validation rules
 Check that RS(T) ∩ WS(U)= φ for any previously
validated U that did not finish before T has started i.e.
FIN(U)>START(T).
 Check that WS(T) ∩ WS(U)= φ for any previously
validated U that did not finish before T is validated
i.e. FIN(U)>VAL(T)
Example
Solution
 Validation of U:
Nothing to check
 Validation of T:
WS(U) ∩ RS(T)= {D} ∩{A,B}=φ
WS(U) ∩ WS(T)= {D}∩ {A,C}=φ
 Validation of V:
RS(V) ∩ WS(T)= {B}∩{A,C}=φ
WS(V) ∩ WS(T)={D,E}∩ {A,C}=φ
RS(V) ∩ WS(U)={B} ∩{D}=φ
 Validation of W:
RS(W) ∩ WS(T)= {A,D}∩{A,C}={A}
WS(W) ∩ WS(V)= {A,D}∩{D,E}={D}
WS(W) ∩ WS(V)= {A,C}∩{D,E}=φ
(W is not validated)
Comparison
Concurrency
control
Mechanisms
Storage Utilization
Delays
Locks
Space in the lock table is
Delays transactions but
proportional to the number of avoids rollbacks
database elements locked.
Timestamps
Space is needed for read and
write times with every
database element, neither or
not it is currently accessed.
Do not delay the
transactions but cause
them to rollback unless
Interface is low
Validation
Space is used for timestamps
and read or write sets for
each currently active
transaction, plus a few more
transactions that finished
after some currently active
Do not delay the
transactions but cause
them to rollback unless
interface is low
21.1 Introduction to
Information Integration
CS257 Fan Yang
Need for Information Integration


All the data in the world could put in a single
database (ideal database system)
In the real world (impossible for a single database):
databases are created independently
It is difficult to design a database to support future
use due to evolving nature of information.
University Database

Registrar: to record student and grade

Bursar: to record tuition payments by students


Human Resources Department: to record
employees
Other department….
Inconvenient


Record grades for students who pay tuition
Want to swim in SJSU aquatic center for free in
summer vacation?
(all the cases above cannot achieve the function by
a single database)

Solution: one database
How to integrate

Start over
build one database: contains all the legacy
databases; rewrite all the applications
result: painful and it is impractical

Build a layer of abstraction (middleware)
on top of all the legacy databases
this layer is often defined by a collection of classes
BUT…
Heterogeneity Problem

What is Heterogeneity Problem
Aardvark Automobile Co.
1000 dealers has 1000 databases
to find a model at another dealer
can we use this command:
SELECT * FROM CARS
WHERE MODEL=“A6”;
Type of Heterogeneity

Communication Heterogeneity

Query-Language Heterogeneity

Schema Heterogeneity

Data type difference

Value Heterogeneity

Semantic Heterogeneity
Conclusion

One database system is perfect, but impossible

Independent database is inconvenient

Integrate database
1. start over
2. middleware

heterogeneity problem
Chapter 21.2
Modes of Information Integration
ID: 219
Name: Qun Yu
Class: CS257 219 Spring 2009
Instructor: Dr. T.Y.Lin
Content Index
21.2 Modes of Information Integration
21.2.1 Federated Database Systems
21.2.2 Data Warehouses
21.2.3 Mediators
Federations
 The simplest architecture for integrating
several DBs
 One to one connections between all pairs of
DBs
 n DBs talk to each other, n(n-1) wrappers are
needed
 Good when communications between DBs are
limited
Wrapper

Wrapper : a software translates incoming queries
and outgoing answers. In a result, it allows
information sources to conform to some shared
schema.
Federations Diagram
DB1
DB2
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
DB3
DB4
A federated collection of 4 DBs needs 12 components to translate queries
from one to another.
Example
Car dealers want to share their inventory. Each dealer queries the
other’s DB to find the needed car.
Dealer-1’s DB relation: NeededCars(model,color,autoTrans)
Dealer-2’s DB relation: Auto(Serial, model, color)
Options(serial,option)
wrapper
Dealer-1’s DB
wrapper
Dealer-2’s DB
Example…
For(each tuple(:m,:c,:a) in NeededCars){
if(:a=TRUE){/* automatic transmission wanted */
SELECT serial
FROM Autos, Options
WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’
AND Autos.model = :m AND Autos.color =:c;
}
Else{/* automatic transmission not wanted */
SELECT serial
FROM Auto
WHERE Autos.model = :m AND
Autos.color = :c AND
NOT EXISTS( SELECT * FROM Options WHERE serial = Autos.serial
AND option=‘autoTrans’);
}
}
Dealer 1 queries Dealer 2 for needed cars
Data Warehouse



Sources are translated from their local schema to a
global schema and copied to a central DB.
User transparent: user uses Data Warehouse just
like an ordinary DB
User is not allowed to update Data Warehouse
Warehouse Diagram
User
query
result
Warehouse
Combiner
Extractor
Extractor
Source 1
Source 2
Example
Construct a data warehouse from sources DB of 2 car dealers:
Dealer-1’s schema: Cars(serialNo, model,color,autoTrans,cdPlayer,…)
Dealer-2’s schema: Auto(serial,model,color)
Options(serial,option)
Warehouse’s schema:
AutoWhse(serialNo,model,color,autoTrans,dealer)
Extractor --- Query to extract data from Dealer-1’s data:
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,autoTrans,’dealer1’
From Cars;
Example
Extractor --- Query to extract data from Dealer-2’s data:
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,’yes’,’dealer2’
FROM Autos,Options
WHERE Autos.serial=Options.serial AND
option=‘autoTrans’;
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,’no’,’dealer2’
FROM Autos
WHERE NOT EXISTS ( SELECT * FROM serial =Autos.serial
AND option = ‘autoTrans’);
Construct Data Warehouse
There are mainly 3 ways to constructing
the data in the warehouse:
1)
Periodically reconstructed from the current data in the sources,
once a night or at even longer intervals.
Advantages:
simple algorithms.
Disadvantages:
1) need to shut down the warehouse;
2) data can become out of date.
Construct Data Warehouse
2)
Updated periodically based on the changes(i.e. each night) of
the sources.
Advantages:
involve smaller amounts of data. (important when warehouse is large and
needs to be modified in a short period)
Disadvantages:
1) the process to calculate changes to the warehouse is complex.
2) data can become out of date.
Construct Data Warehouse
3)
Changed immediately, in response to each change or a small set
of changes at one or more of the sources.
Advantages:
data won’t become out of date.
Disadvantages:
requires too much communication, therefore, it is generally too
expensive.
(practical for warehouses whose underlying sources changes slowly.)
Mediators

Virtual warehouse, which supports a virtual view or a
collection of views, that integrates several sources.

Mediator doesn’t store any data.

Mediators’ tasks:
1)receive user’s query,
2)send queries to wrappers,
3)combine results from wrappers,
4)send the final result to user.
A Mediator diagram
Result
User query
Mediator
Query
Result
Result
Wrapper
Query
Result
Source 1
Query
Wrapper
Query
Result
Source 2
Example
Same data sources as the example of data warehouse, the mediator
Integrates the same two dealers’ source into a view with schema:
AutoMed(serialNo,model,color,autoTrans,dealer)
When the user have a query:
SELECT sericalNo, model
FROM AkutoMed
Where color=‘red’
Example
In this simple case, the mediator forwards the same query to each
Of the two wrappers.
Wrapper1: Cars(serialNo, model, color, autoTrans, cdPlayer, …)
SELECT serialNo,model
FROM cars
WHERE color = ‘red’;
Wrapper2: Autos(serial,model,color); Options(serial,option)
SELECT serial, model
FROM Autos
WHERE color=‘red’;
The mediator needs to interprets serial into serialNo, and then
returns the union of these sets of data to user.
Example
There may be different options for the mediator to forward user query,
for example, the user queries if there are a specific model&color car
(i.e. “Gobi”, “blue”).
The mediator decides 2nd query is needed or not based on the result of
1st query. That is, If dealer-1 has the specific car, the mediator doesn’t
have to query dealer-2.
Chapter 21
Information Integration
21.3 Wrappers in Mediator-Based Systems
Presented by: Kai Zhu
Professor: Dr. T.Y. Lin
Class ID: 220
Introduction

Templates for Query patterns

Wrapper Generator

Filter
Wrappers in Mediator-based Systems



More complicated than that in most data warehouse
system.
Able to accept a variety of queries from the mediator
and translate them to the terms of the source.
Communicate the result to the mediator.
How to design a wrapper?
Classify the possible queries that the mediator can ask into
templates, which are queries with parameters that represent
constants.
Templates for Query Patterns:

Use notation T=>S to express the idea that the
template T is turned by the wrapper into the source
query S.

Example 1
Dealer 1
Cars (serialNo, model, color, autoTrans, navi,…)
For use by a mediator with schema
AutoMed (serialNo, model, color, autoTrans, dealer)

We denote the code representing that color by the
parameter $c, then the template will be:
SELECT *
FROM AutosMed
WHERE color = ’$c’;
=>
SELECT serialNo, model, color, autoTrans, ’dealer1’
FROM Cars
WHERE color=’$c’;
(Template T => Source query S)
There will be total 2n templates if we have the
option of specifying n attributes.
Wrapper Generators


The wrapper generator creates a table holds the
various query patterns contained in the templates.
The source queries are associated with each.
A driver is used in each wrapper, the task of the
driver is to:

Accept a query from the mediator.

Search the table for a template that matches the query.


The source query is sent to the source, again using a “plugin” communication mechanism.
The response is processed by the wrapper.
Filter

Have a wrapper filter to supporting more queries.


Example 2
If wrapper is designed with more complicated
template with queries specify both model and color.
SELECT *
FROM AutosMed
WHERE model = ’$m’ AND color = ’$c’;
=>
SELECT serialNo, model, color, autoTrans, ’dealer1’
FROM Cars
WHERE model = ’$m’ AND color=’$c’;

Now we suppose the only template we have is color.
However the wrapper is asked by the Mediator to
find “blue Gobi model car.”
Solution:
1. Use template with $c=‘blue’ find all blue cars and store
them in a temporary relation:
TemAutos (serialNo, model, color, autoTrans, dealer)
2.The wrapper then return to the mediator the desired set of
automobiles by excuting the local query:
SELECT*
FROM TemAutos
WHERE model= ’Gobi’;
INFORMATION
INTEGRATION
SECTIONS 21.4
Sanuja Dabade & Eilbroun Benjamin
CS 257 – Dr. TY Lin
Presentation Outline
21.4 Capability Based
Optimization
21.4.1The Problem of Limited Source
Capabilities
21.4.2 A notation for Describing
Source Capabilities
21.4.3 Capability-Based Query-Plan
Selection
21.4.4 Adding Cost-Based
Optimization
21.5 Optimizing Mediator
Queries
21.4 Capability Based
Optimization
• Introduction
– Typical DBMS estimates the cost of each
query plan and picks what it believes to be the
best
– Mediator – has knowledge of how long its
sources will take to answer
– Optimization of mediator queries cannot rely
on cost measure alone to select a query plan
– Optimization by mediator follows capability
based optimization
21.4.1 The Problem of Limited
Source Capabilities
• Many sources have only Web Based
interfaces
• Web sources usually allow querying
through a query form
• E.g. amazon.com interface allows us to
query about books in many different ways.
• But we cannot ask questions that are too
general
– E.g. Select * from books;
21.4.1 The Problem of Limited
Source
Capabilities (con’t)
• Reasons why a source may limit the ways
in which queries can be asked
– Earliest database did not use relational DBMS
that supports SQL queries
– Indexes on large database may make certain
queries feasible, while others are too expensive
to execute
– Security reasons
• E.g. Medical database may answer queries about
averages, but won’t disclose details of a particular
21.4.2 A Notation for Describing
Source Capabilities
• For relational data, the legal forms of
queries are described by adornments
• Adornments – Sequences of codes that
represent the requirements for the attributes
of the relation, in their standard order
– f(free) – attribute can be specified or not
– b(bound) – must specify a value for an
attribute but any value is allowed
– u(unspecified) – not permitted to specify a
value for a attribute
21.4.2 A notation for Describing
Source apabilities….(cont’d)
– c[S](choice from set S) means that a value
must be specified and value must be from
finite set S.
– o[S](optional from set S) means either do not
specify a value or we specify a value from
finite set S
– A prime (f’) specifies that an attribute is not a
part of the output of the query
 A capabilities specification is a set of
adornments
21.4.2 A notation for Describing
Source Capabilities….(cont’d)
• E.g. Dealer 1 is a source of data in the
form:
Cars (serialNo, model, color, autoTrans, navi)
The adornment for this query form is b’uuuu
21.4.3 Capability-Based QueryPlan Selection
• Given a query at the mediator, a capability
based query optimizer first considers what
queries it can ask at the sources to help
answer the query
• The process is repeated until:
– Enough queries are asked at the sources to
resolve all the conditions of the mediator query
and therefore query is answered. Such a plan is
called feasible.
21.4.3 Capability-Based QueryPlan Selection (cont’d)
• The simplest form of mediator query using
above strategy is join relations
• E.g we have sources for dealer 2
– Autos(serial, model, color)
– Options(serial, option)
• Suppose that ubf is the sole adornment for Auto
and Options have two adornments, bu and
uc[autoTrans, navi]
• Query is – find the serial numbers and colors of
Gobi models with a navigation system
21.4.4 Adding Cost-Based
Optimization
• Mediator’s Query optimizer is not done
when the capabilities of the sources are
examined
• Having found feasible plans, it must choose
among them
• Making an intelligent, cost based query
optimization requires that the mediator
knows a great deal about the costs of
queries involved
INFORMATION
INTEGRATION
SECTION 21.5
Eilbroun Benjamin
CS 257 – Dr. TY Lin
Presentation Outline
21.5 Optimizing Mediator
Queries
21.5.1 Simplified Adornment
Notation
21.5.2 Obtaining Answers for
Subgoals
21.5.3 The Chain Algorithm
21.5.4 Incorporating Union Views
at the Mediator
21.5 Optimizing Mediator Queries
• Chain algorithm – a greedy algorithm
which finds answer by sending a sequence
of requests to its sources.
– Will always find a solution assuming at least
one solution exists.
– The solution may not be optimal.
21.5.1 Simplified Adornment
Notation
• A query at the mediator is limited to b
(bound) and f (free) adornments.
• We use the following convention for
describing adornments:
– nameadornments(attributes)
– where:
• name is the name of the relation
• the number of adornments = the number of
attributes
21.5.2 Obtaining Answers for
Subgoals
• Rules for subgoals and sources:
– Suppose we have the following subgoal:
Rx1x2…xn(a1, a2, …, an),
and source adornments for R are: y1y2…yn.
• If yi is b or c[S], then xi = b.
• If xi = f, then yi is not output restricted.
– The adornment on the subgoal matches the
adornment at the source:
• If yi is f, u, or o[S] and xi is either b or f.
21.5.3 The Chain Algorithm
• Maintains 2 types of information:
– An adornment for each subgoal.
– A relation X that is the join of the relations for
all the subgoals that have been resolved.
• Initially, the adornment for a subgoal is b
if the mediator query provides a constant
binding for the corresponding argument of
that subgoal.
• Initially, X is a relation over no attributes,
21.5.3 The Chain Algorithm
(con’t)
• First, initialize adornments of subgoals
and X.
• Then, repeatedly select a subgoal that can
be resolved. Let Rα(a1, a2, …, an) be the
subgoal:
1.Wherever α has a b, we shall find the
argument in R is a constant, or a variable
in the schema of R.
– Project X onto its variables that appear in R.
21.5.3 The Chain Algorithm
(con’t)
2. For each tuple t in the project of X, issue
a query to the source as follows (β is a
source adornment).
– If a component of β is b, then the
corresponding component of α is b, and we
can use the corresponding component of t
for source query.
– If a component of β is c[S], and the
corresponding component of t is in S, then
the corresponding component of α is b, and
21.5.3 The Chain Algorithm
(con’t)
– If a component of β is u, then provide no
binding for this component in the source
query.
– If a component of β is o[S], and the
corresponding component of α is f, then
treat it as if it was a f.
– If a component of β is o[S], and the
corresponding component of α is b, then
treat it as if it was c[S].
3. Every variable among a1, a2, …, an is
21.5.3 The Chain Algorithm
(con’t)
4. Replace X with X πs(R), where S is all
of the variables among: a1, a2, …, an.
5. Project out of X all components that
α
correspond to variables that do not
appear in the head or in any unresolved
subgoal.
• If every subgoal is resolved, then X is the
answer.
• If every subgoal is not resolved, then the
21.5.3 The Chain Algorithm Example
• Mediator query:
– Q: Answer(c) ← Rbf(1,a) AND Sff(a,b) AND
Tff(b,c)
• Example: w
Relation 1R
1
Data
1
x
x
y
y
z
2
2
S4
4
T6
3
3
5
5
7
5
8
4
21.5.3 The Chain Algorithm Example
(con’t)
• Initially, the adornments on the subgoals
are the same as Q, and X contains an
empty tuple.
– S and T cannot be resolved because they each
have ff adornments, but the sources have
either a b or c.
• R(1,a) can be resolved because its
adornments are matched by the source’s
adornments.
21.5.3 The Chain Algorithm Example
(con’t)
• Project the subgoal’s relation onto its
second component, since only the second
component of R(1,a)
is a variable.
a
2
3
4
• This is joined with X, resulting in X
21.5.3 The Chain Algorithm Example
(con’t)
• Now we resolve Sbf(a,b):
– Project X onto a, resulting in X.
– Now, search S for tuples with attribute a
a
b
equivalent to attribute a in X.
2
4
3
5
• Join this relation with
b X, and remove a
4
because it doesn’t appear
in the head nor
5
any unresolved subgoal:
21.5.3 The Chain Algorithm Example
(con’t)
• Now we resolve Tbf(b,c):
b
c
4
6
5
7
5
8
• Join this relation with X and project onto
the c attribute to get the relation for the
head.
• Solution is {(6), (7), (8)}.
21.5.4 Incorporating Union Views at
the Mediator
• This implementation of the Chain
Algorithm does not consider that several
sources can contribute tuples to a relation.
• If specific sources have tuples to
contribute that other sources may not
have, it adds complexity.
• To resolve this, we can consult all sources,
or make best efforts to return all the
answers.
21.5.4 Incorporating Union Views at
the Mediator (con’t)
• Consulting All Sources
– We can only resolve a subgoal when each
source for its relation has an adornment
matched by the current adornment of the
subgoal.
– Less practical because it makes queries
harder to answer and impossible when source
is down.
• Best Efforts
LOCAL-AS-VIEW
MEDIATORS
Priya Gangaraju(Class Id:203)
Local-as-View Mediators.
• In a LAV mediator, global predicates
defined are not views of the source data.
• Expressions are defined, involving the
global predicates that describe the tuples
that the source is able to produce.
• Queries are answered at the mediator by
discovering all possible ways to construct
the query using the views provided by the
source.
Motivation for LAV Mediators
• Relationship between what the mediator
should provide and what the sources
provide is more subtle.
• Example, consider the predicate Par(c, p)
meaning that p is a parent of c which
represents the set of all child parent facts
that could ever exist.
• The sources will provide information about
whatever child-parent facts they know.
Motivation(contd..)
• There can be sources which may provide
child-grandparent facts but not child- parent
facts at all.
• This source can never be used to answer the
child-parent query under GAV mediators.
• LAV mediators allow to say that a certain
source provides grand parent facts.
• They help discover how and when to use
that source in a given query.
Terminology for LAV Mediation.
• The queries at the mediator and that
describe the source will be single Datalog
rules.
• Single Datalog rule is often called a
conjunctive query.
• The global predicates of the LAV mediator
are used as the sub-goals of mediator
queries.
• Conjunctive queries define views.
Contd..
• Their heads each have a unique view
predicate that is the name of a view.
• Each view definition has a body consisting
of global predicates and is associated with a
particular source.
• Assumption - each view can be constructed
with an all-free adornment.
Example..
• Consider global predicate Par(c, p) meaning
that p is a parent of c.
• One source produces parent facts. Its view
is defined by the conjunctive queryV1(c, p)  Par(c, p)
• Another source produces some grand
parents facts. Then its conjunctive query
will be –
V (c, g)  Par(c, p) AND Par(p, g)
Example contd..
• The query at the mediator will ask for
great-grand parent facts that can be
obtained from the sources. The mediator
query is –
Q(w, z)  Par(w, x) AND Par(x, y) AND
Par(y, z)
• One solution can be using the parent
predicate(V1) directly three times.
Example contd..
• Another solution can be to use V1(parent
facts) and V2(grandparent facts).
Q(w, z)  V1(w, x) AND V2(x, z)
Or
Q(w, z)  V2(w, y) AND V1(y, z)
Expanding Solutions.
• Consider a query Q, a solution S that has a
body whose subgoals are views and each
view V is defined by a conjunctive query
with that view as the head.
• The body of V’s conjunctive query can be
substituted for a subgoal in S that uses the
predicate V to have a body consisting of
only global predicates.
Expansion Algorithm
• A solution S has a subgoal V(a1, a2,…an)
where ai’s can be any variables or constants.
• The view V can be of the form
V(b1, b2,….bn)  B
Where B represents the entire body.
• V(a1, a2, … an) can be replaced in solution
S by a version of body B that has all the
subgoals of B with variables possibly
altered.
Expansion Algorithm contd..
The rules for altering the variables of B
are:
1.First identify the local variables B,
variables that appear in the body but not in
the head.
2.If there are any local variables of B that
appear in B or in S, replace each one by a
distinct new variable that appears nowhere
in the rule for V or in S.
Example.
• Consider the view definitions,
V1(c, p)  Par(c, p)
V2(c, g)  Par(c, p) AND Par(p, g)
• One of the proposed solutions S is
Q(w, z)  V1(w, x) AND V2(x, z)
• The first subgoal with predicate V1 in the
solution can be expanded as Par(w, x) as
there are no local variables.
Example Contd.
• The V2 subgoal has a local variable p
which doesn’t appear in S nor it has been
used as a local variable in another
substitution. So p can be left as it is.
• Only x and z are to be substituted for
variables c and g.
• The Solution S now will be
Q(w, z)  Par(w, x) AND Par(x, p) AND
Par(p,z)
Containment of Conjunctive
Queries
A containment mapping from Q to E is a
function т from the variables of Q to the
variables and constants of E, such that:
1.If x is the ith argument of the head of Q,
then т(x) is the ith argument of the head of
E.
2.Add to т the rule that т(c)=c for any
constant c. If P(x1,x2,… xn) is a subgoal of
Q, then P(т(x1), т(x2),… т(xn)) is a
Example.
• Consider two Conjunctive queries:
Q1: H(x, y)  A(x, z) and B(z, y)
Q2: H(a, b)  A(a, c) AND B(d, b) AND
A(a, d)
• When we apply the substitution,
Т(x) = a, Т(y) = b, Т(z) = d, the head of Q1
becomes H(a, b) which is the head of Q2.
So,there is a containment mapping from Q1
Example contd..
• The first subgoal of Q1 becomes A(a, d)
which is the third subgoal of Q2.
• The second subgoal of Q1 becomes the
second subgoal of Q2.
• There is also a containment mapping from
Q2 to Q1 so the two conjunctive queries are
equivalent.
Why the Containment-Mapping
Test Works
• Suppose there is a containment mapping т
from Q1 to Q2.
• When Q2 is applied to the database, we
look for substitutions σ for all the variables
of Q2.
• The substitution for the head becomes a
tuple t that is returned by Q2.
• If we compose т and then σ, we have a
mapping from the variables of Q1 to tuples
of the database that produces the same tuple
Finding Solutions to a Mediator
Query
• There can be infinite number of solutions
built from the views using any number of
subgoals and variables.
• LMSS Theorem can limit the search which
states that
• If a query Q has n subgoals, then any answer
produced by any solution is also produced by a
solution that has at most n subgoals.
• If the conjunctive query that defines a view
V has in its body a predicate P that doesn’t
appear in the body of the mediator query,
then we need not consider any solution that
Example.
• Recall the query
Q1: Q(w, z) Par(w, x) AND Par(x, y) AND
Par(y, z)
• This query has three subgoals, so we don’t
have to look at solutions with more than
three subgoals.
Why the LMSS Theorem Holds
• Suppose we have a query Q with n
subgoals and there is a solution S with
more than n subgoals.
• The expansion E of S must be contained in
Query Q, which means that there is a
containment mapping from Q to E.
• We remove from S all subgoals whose
expansion was not the target of one of Q’s
subgoals under the containment mapping.
Contd..
• We would have a new conjunctive query S’
with at most n subgoals.
• If E’ is the expansion of S’ then, E’ is a
subset of Q.
• S is a subset of S’ as there is an identity
mapping.
• Thus S need not be among the solutions to
query Q.
Information Integration
Entity Resolution – 21.7
Presented By:
Deepti Bhardwaj
Roll No: 223_103
Contents

21.7 Entity Resolution
21.7.1 Deciding Whether Records Represent a
Common Entity
 21.7.2 Merging Similar Records
 21.7.3 Useful Properties of Similarity and Merge
Functions
 21.7.4 The R-Swoosh Algorithm for ICAR Records
 21.7.5 Other Approaches to Entity Resolution

Introduction

Determining whether two records or tuples do or do
not represent the same person, organization, place or
other entity is called ENTITY RESOLUTION.
Deciding whether Records represent
a Common Entity


Two records represent the same individual if the two
records have similar values for each of the fields
associated with those records.
It may be possible that two records are same but
corresponding fields are not identical because of
following reasons:
1. Misspellings
2. Variant Names
3. Misunderstanding of Names
Continue: Deciding whether Records
represent a Common Entity
4. Evolution of Values
5. Abbreviations
Thus when deciding whether two records represent the
same entity, we need to look carefully at the kinds of
discrepancies and use the test that measures the similarity
of records.
Deciding Whether Records
Represents a Common Entity - Edit
Distance



First approach to measure the similarity of records is
Edit Distance.
Values that are strings can be compared by counting the
number of insertions and deletions of characters it takes
to turn one string into another.
So the records represent the same entity if their
similarity measure is below a given threshold.
Deciding Whether Records
Represents a Common Entity Normalization


To normalize records by replacing certain substrings by
others. For instance: we can use the table of abbreviations
and replace abbreviations by what they normally stand for.
Once normalize we can use the edit distance to measure the
difference between normalized values in the fields.
Merging Similar Records


Merging means replacing two similar records with single
record which contain information of both.
There are many merge rules:
1. Set the field in which the records disagree to the
empty string.
2. (i) Merge by taking the union of the values in each
field
(ii) Declare two records similar if at least two of the
three fields have a nonempty intersection.
Continue: Merging Similar
Records
Name
Address
Phone
1. Susan
123 Oak St.
818-555-1234
2. Susan
456 Maple St.
818-555-1234
3. Susan
456 Maple St.
213-555-5678
After Merging
Name
Address
(1-2-3) Susan
Phone
{123 Oak St.,456 Maple St}
{818-555-1234, 213-555-5678}
Properties of Similarity and Merge
Functions
The following properties say that the merge operation is a
semi lattice :
1.
2.
3.
Idempotence : That is, the merge of a record with itself
should surely be that record.
Commutativity : Order of merging should not matter
and end result should be same.
Associativity : The order in which we group records
for a merger should not matter.
Continue: Useful Properties of Similarity
and Merge Functions
There are some other properties that we expect similarity
relationship to have:



Idempotence for similarity : A record is always similar
to itself
Commutativity of similarity : In deciding whether two
records are similar it does not matter in which order we
list them
Representability : If ‘r’ is similar to record ‘s’, and ‘s’ is
merged with some other record t’’, then ‘r’ remains similar
to the merger of ‘s’ and ‘t’ and can be merged with that
record.
R-swoosh Algorithm for ICAR
Records

Input: A set of records I, similarity function and a merge function.

Output: A set of merged records O.

Method:


O:= emptyset;
WHILE I is not empty DO BEGIN
 Let r be any record in I;
 Find, if possible, some record s in O that is similar to r;
 IF no record s exists THEN
move r from I to O
 ELSE BEGIN
delete r from I;
delete s
from O;
add the merger of r and s to I;
 END;

END;
Other Approaches to Entity
Resolution
The other approaches to entity resolution are :

Non- ICAR Datasets

Clustering

Partitioning
Other Approaches to Entity
Resolution - Non ICAR Datasets
Non ICAR Datasets : We can define a dominance
relation r<=s that means record s contains all the
information contained in record r.
If so, then we can eliminate record r from further
consideration.
Other Approaches to Entity
Resolution - Clustering
Clustering: Some time we group the records into clusters
such that members of a cluster are in some sense similar to
each other and members of different clusters are not similar.
Other Approaches to Entity
Resolution - Partitioning
Partitioning: We can group the records, perhaps several
times, into groups that are likely to contain similar records
and look only within each group for pairs of similar records.
The Query Compiler
16.1 Parsing and Preprocessing
Meghna Jain(205)
Dr. T. Y. Lin
Presentation Outline
16.1 Parsing and Preprocessing
16.1.1 Syntax Analysis and Parse Tree
16.1.2 A Grammar for Simple Subset of SQL
16.1.3 The Preprocessor
16.1.4 Processing Queries Involving Views
Query compilation is divided
into three steps
1. Parsing: Parse SQL query into parser tree.
2. Logical query plan: Transforms parse tree into expression
tree of relational algebra.
3.Physical query plan:
Transforms logical query plan into
physical query plan.
. Operation performed
. Order of operation
. Algorithm used
. The way in which stored data is obtained and passed from one
operation to another.
Query
Parser
Preprocessor
Logical Query plan
generator
Query rewrite
Preferred logical
query plan
Form a query to a logical query plan
Syntax Analysis and Parse Tree
Parser takes the sql query and convert it to parse
tree. Nodes of parse tree:
1. Atoms: known as Lexical elements such as key
words, constants, parentheses, operators, and
other schema elements.
2. Syntactic categories: Subparts that plays a
similar role in a query as <Query> , <Condition>
Grammar for Simple Subset of SQL
<Query> ::= <SFW>
<Query> ::= (<Query>)
<SFW> ::= SELECT <SelList> FROM <FromList> WHERE <Condition>
<SelList> ::= <Attribute>,<SelList>
<SelList> ::= <Attribute>
<FromList> ::= <Relation>, <FromList>
<FromList> ::= <Relation>
<Condition> ::= <Condition> AND <Condition>
<Condition> ::= <Tuple> IN <Query>
<Condition> ::= <Attribute> = <Attribute>
<Condition> ::= <Attribute> LIKE <Pattern>
<Tuple> ::= <Attribute>
Atoms(constants), <syntactic categories>(variable),
::= (can be expressed/defined as)
Query and Parse Tree
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate)
Query:
Give titles of movies that have at least one star born in 1960
SELECT title FROM StarsIn WHERE starName IN
(
SELECT name FROM MovieStar WHERE
birthdate LIKE '%1960%'
);
Another query equivalent
SELECT title
FROM StarsIn, MovieStar
WHERE starName = name AND
birthdate LIKE '%1960%' ;
Parse Tree
<Query>
<SFW>
SELECT <SelList> FROM
<Attribute>
<FromList>
WHERE
<Condition>
<RelName> , <FromList>
title
StarsIn
AND
<RelName>
MovieStar
<Query>
<Condition>
<Attribute>
starName
=
<Attribute>
name
<Condition>
<Attribute> LIKE <Pattern>
birthdate
‘%1960’
The Preprocessor
Functions of Preprocessor
. If a relation used in the query is virtual view then each use of this relation in
the form-list must replace by parser tree that describe the view.
. It is also responsible for semantic checking
1. Checks relation uses : Every relation mentioned in FROMclause must be a relation or a view in current schema.
2. Check and resolve attribute uses: Every attribute mentioned
in SELECT or WHERE clause must be an attribute of same
relation in the current scope.
3. Check types: All attributes must be of a type appropriate to
their uses.
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate)
Query:
Give titles of movies that have at least one star born in 1960
SELECT title FROM StarsIn WHERE starName IN
(
SELECT name FROM MovieStar WHERE
birthdate LIKE '%1960%'
);
Preprocessing Queries Involving Views
When an operand in a query is a virtual view, the preprocessor needs to
replace the operand by a piece of parse tree that represents how the view is
constructed from base table.
Base Table: Movies( title, year, length, genre, studioname,
producerC#)
View definition : CREATE VIEW ParamountMovies AS
SELECT title, year FROM movies
WHERE studioName = 'Paramount';
Example based on view:
SELECT title FROM ParamountMovies WHERE year = 1979;
16.2 ALGEBRAIC LAWS FOR
IMPROVING QUERY PLANS
Ramya Karri
ID: 206
Optimizing the Logical Query Plan

The translation rules converting a parse tree to a logical
query tree do not always produce the best logical query
tree.

It is often possible to optimize the logical query tree by
applying relational algebra laws to convert the original
tree into a more efficient logical query tree.

Optimizing a logical query tree using relational algebra
laws is called heuristic optimization
Relational Algebra Laws
These laws often involve the properties of:

commutativity - operator can be applied to operands
independent of order.


E.g. A + B = B + A - The “+” operator is commutative.
associativity - operator is independent of operand
grouping.

E.g. A + (B + C) = (A + B) + C - The “+” operator is
associative.
Associative and Commutative
Operators

The relational algebra operators of cross-product
(×), join (⋈), union, and intersection are all
associative and commutative.
Commutative
Associative
R X S=S X R
(R X S) X T = S X (R X T)
R⋈S=S⋈R
(R ⋈ S) ⋈ T= S ⋈ (R ⋈ T)
RS=SR
(R  S)  T = S  (R  T)
R ∩S =S∩ R
(R ∩ S) ∩ T = S ∩ (R ∩ T)
Laws Involving Selection




Selection Reduce the size of relation.
To make efficient query, the selection must be moved down the
tree without the changing what the expression does.
Complex selections involving AND or OR can be broken into
two or more selections: (splitting laws)
σC1 AND C2 (R) = σC1( σC2 (R))
σC1 OR C2 (R) = ( σC1 (R) ) S ( σC2 (R) )
Example






R={a,a,b,b,b,c}
p1 satisfied by a,b, p2 satisfied by b,c
σp1vp2 (R) = {a,a,b,b,b,c}
σp1(R) = {a,a,b,b,b}
σp2(R) = {b,b,b,c}
σp1 (R) U σp2 (R) = {a,a,b,b,b,c}
Laws Involving Selection (Contd..)

Selection is pushed through both arguments
for union:
σC(R  S) = σC(R)  σC(S)

Selection is pushed to the first argument and
optionally the second for difference:
σC(R - S) = σC(R) - S
σC(R - S) = σC(R) - σC(S)
Laws Involving Selection (Contd..)


All other operators require selection to be pushed
to only one of the arguments.
For joins, may not be able to push selection to
both if argument does not have attributes selection
requires.
σC(R × S) = σC(R) × S
σC(R ∩ S) = σC(R) ∩ S
σC(R ⋈ S) = σC(R) ⋈ S
σC(R ⋈D S) = σC(R) ⋈D S
Laws Involving Selection (Contd..)


Example
Consider relations R(a,b) and S(b,c) and the
expression
σ (a=1 OR a=3) AND b<c (R ⋈S)
 σ a=1 OR a=3(σ b<c (R ⋈S))
 σ a=1 OR a=3(R ⋈ σ b<c (S))
 σ a=1 OR a=3(R) ⋈ σ b<c (S)

Laws Involving Projection



Projection, like selection can be pushed
down through many other operators
Pushing Projection usually involves
introducing a new projection somewhere
below an existing projection.
A Projection differs from selection in the
aspect that projection reduces the length of
the tuples whereas selection reduces the
number of the tuples.
Laws Involving Projection

Laws for pushing projections with joins:
πL(R × S) = πL(πM(R) × πN(S))
πL(R ⋈ S) = πL((πM(R) ⋈ πN(S))
πL(R ⋈D S) = πL((πM(R) ⋈D πN(S))
Laws Involving Projection

Laws for pushing projections with set operations.

Projection can be performed entirely before union.
πL(R UB S) = πL(R) UB πL(S)

Projection can be pushed below selection as long
as we also keep all attributes needed for the
selection (M = L  attr(C)).
πL ( σC (R)) = πL( σC (πM(R)))
Laws Involving Join

We have previously seen these important
rules about joins:
1. Joins are commutative and associative.
1. Selection can be distributed into joins.
1. Projection can be distributed into joins.
Laws Involving Duplicate Elimination




The duplicate elimination operator (δ) can
be pushed through many operators.
R has two copies of tuples t, S has one
copy of t,
δ (RUS)=one copy of t
δ (R) U δ (S)=two copies of t
Laws Involving Duplicate Elimination

Laws for pushing duplicate elimination
operator (δ):
δ(R × S) = δ(R) × δ(S)
δ(R
S) = δ(R)
δ(S)
δ(R D S) = δ(R)
D δ(S)
δ( σC(R) = σC(δ(R))
Laws Involving Duplicate Elimination

The duplicate elimination operator (δ) can
also be pushed through bag intersection, but
not across union, difference, or projection in
general.
δ(R ∩ S) = δ(R) ∩ δ(S)
Laws Involving Grouping

The grouping operator (γ) laws depend on the aggregate
operators used.

There is one general rule, however, that grouping subsumes
duplicate elimination:
δ(γL(R)) = γL(R)

The reason is that some aggregate functions are unaffected by
duplicates (MIN and MAX) while other functions are (SUM,
COUNT, and AVG).
The Query Compiler
Section 16.3
DATABASE SYSTEMS – The Complete Book
Presented By:
Deepti Kundu
Under the supervision of:
Dr. T.Y.Lin
Topics to be covered

From Parse to Logical Query Plans

Conversion to Relational Algebra

Removing Sub-queries From Conditions

Improving the Logical Query Plan

Grouping Associative/ Commutative Operators
16.3 From Parse to Logical Query Plans
►
Review
Query
Parser
Section 16.1
Preprocessor
Logical query
plan generator
Section 16.3
Query Rewriter
Preferred logical query plan
Two steps to turn Parse tree into Preferred
Logical Query Plan


Replace the nodes and structures of the parse tree, in
appropriate groups, by an operator or operators of relational
algebra.
Take the relational algebra expression and turn it into an
expression that we expect can be converted to the most efficient
physical query plan.
Reference Relations

StarsIn (movieTitle, movieYear, starName)

MovieStar (name, address, gender, birthdate)
Conversion to Relational Algebra

The complete algorithm depends on specific grammar,
which determines forms of the parse trees. If we have a
<Query> with a <Condition> that has no subqueries, then
we may replace the entire construct – the select-list, fromlist, and condition – by a relational-algebra expression.

The relational-algebra expression consists of the
following from bottom to top:



The products of all the relations mentioned in the
<FromList>, which Is the argument of:
A selection σC, where C is the <Condition> expression in the
construct being replaced, which in turn is the argument of:
A projection πL , where L is the list of attributes in the
<SelList>
A query : Example

SELECT movieTitle
FROM Starsin, MovieStar
WHERE starName = name AND
birthdate LIKE ‘%1960’;
SELECT movieTitle
FROM Starsin, MovieStar
WHERE starName = name AND
birthdate LIKE ‘%1960’;
Equivalent algebraic expression tree
Removing Subqueries From Conditions



For parse trees with a <Condition> that has a
subquery
Intermediate operator – two argument selection
It is intermediate in between the syntactic
categories of the parse tree and the relationalalgebra operators that apply to relations.
Using a two-argument σ
πmovieTitle
σ
<Condition>
StarsIn
<Tuple>
<Attribute>
starName
IN
πname
σ birthdate LIKE ‘%1960'
MovieStar
Two argument selection with condition
involving IN
Now say we have, two arguments – some relation and the
second argument is a <Condition> of the form t IN S.


‘t’ – tuple composed of some attributes of R

‘S’ – uncorrelated subquery
Steps to be followed:

1.
Replace the <Condition> by the tree that is the expression for S ( δ is
used to remove duplicates)
2.
Replace the two-argument selection by a one-argument selection σC.
3.
Give σC an argument that is the product of R and S.
Two argument selection with condition
involving IN
σ
R
σC
<Condition>
t
IN
X
S
R
δ
S
The effect
Improving the Logical Query Plan

Algebraic laws to improve logical query plans:




Selections can be pushed down the expression tree as far
as they can go.
Similarly, projections can be pushed down the tree, or
new projections can be added.
Duplicate eliminations can sometimes be removed, or
moved to a more convenient position in the tree.
Certain selections can be combined with a product below
to turn the pair of operations into an equijoin.
Grouping Associative/ Commutative
Operators




An operator that is associative and commutative operators
may be though of as having any number of operands.
We need to reorder these operands so that the multiway join
is executed as sequence of binary joins.
Its more time consuming to execute them in the order
suggested by parse tree.
For each portion of subtree that consists of nodes with the
same associative and commutative operator (natural join,
union, and intersection), we group the nodes with these
operators into a single node with many children.
The effect of query rewriting
Π movieTitle
Starname = name
StarsIn
σbirthdate LIKE ‘%1960’
MovieStar
Final step in producing logical query plan
U
=>
R
U
U
U
R
S
T
V
W
S
U
T
V
W
An Example to summarize


“find movies where the average age of the stars was at most 40
when the movie was made”
SELECT distinct m1.movieTitle, m1,movieYear
FROM StarsIn m1
WHERE m1.movieYear – 40 <= (
SELECT AVG (birthdate)
FROM StartsIn m2, MovieStar s
WHERE m2.starName = s.name AND
m1.movieTitle = m2.movieTitle AND
m1.movieYear = m2.movieyear
SELECT distinct m1.movieTitle, m1,movieYear
FROM StarsIn m1
WHERE m1.movieYear – 40 <= (
SELECT AVG (birthdate)
FROM StartsIn m2, MovieStar s
WHERE m2.starName = s.name AND
m1.movieTitle = m2.movieTitle AND
m1.movieYear = m2.movieyear );
Selections combined with a product to turn
the pair of operations into an equijoin…
Condition pushed up the expression tree…
`
Selections combined…
The Query Compiler
(16.4)
DATABASE SYSTEMS – The Complete Book
Presented By:
Maciej Kicinski
Under the supervision of:
Dr. T.Y.Lin
Topics to be covered


From Parse to Logical Query Plans

Conversion to Relational Algebra

Removing Subqueries From Conditions

Improving the Logical Query Plan

Grouping Associative/ Commutative Operators
Estimating the Cost of Operation

Estimating Sizes of Intermediate Relations

Estimating the Size of a Projection

Estimating the Size of a Selection

Estimating the Size of a Join
16.4 From Estimating the Cost of Operation
►
Estimating the Cost of Operations



After getting to the logical query plan, we turn it into
physical plan.
Consider all the possible physical plan and estimate their
costs – this evaluation is known as cost-based enumeration.
The one with least estimated cost is the one selected to be
passed to the query-execution engine.
Selection for each physical plan

We select for each physical plan:




An order and grouping for associative-and-commutative
operations like joins, unions, and intersections.
An algorithm for each operator in the logical plan, for
instance, deciding whether a nested-loop join or hash-join
should be used.
Additional operators – scanning, sorting etc. – that are
needed for the physical plan but that were not present
explicitly in the logical plan.
The way in which the arguments are passed from on operator
to the next.
Estimating Sizes of Intermediate Relations
1.
Give accurate estimates.
2.
Are easy to compute.
3.
•
Are logically consistent; that is, the size estimate for
an intermediate relation should not depend on how
that relation is computed.
Objective of estimation is to select best physical plan
with least cost.
Estimating the Size of a Projection



We should treat a classical, duplicate-eliminating
projection as a bag-projection.
The size of the result can be computed exactly.
There may be reduction in size (due to eliminated
components) or increase in size (due to new
components created as combination of attributes).
Estimating the Size of a Selection


While performing selection, we may reduce the number
of tuples but the sizes of tuple remain same.
Size can be computed as:
S = σ A=c (R)
Where A is an attribute of R and c is a constant

The recommended estimate is
T(S) = T(R)/ V(R,A)
Estimating Sizes of Other Operations

Union

Intersection

Difference

Duplicate Elimination

Grouping and Aggregation
Choosing an Order for Joins
Chapter 16.6 by:
Chiu Luk
ID: 210
Introduction
• This section focuses on critical problem in costbased optimization:
– Selecting order for natural join of three or more
relations
• Compared to other binary operations, joins
take more time and therefore need effective
optimization techniques
Introduction
Significance of Left and Right Join
Arguments
• The argument relations in joins determine the
cost of the join
• The left argument of the join is
– Called the build relation
– Assumed to be smaller
– Stored in main-memory
Significance of Left and Right Join
Arguments
• The right argument of the join is
– Called the probe relation
– Read a block at a time
– Its tuples are matched with those of build relation
• The join algorithms which distinguish between
the arguments are:
– One-pass join
– Nested-loop join
– Index join
Join Trees
• Order of arguments is important for joining
two relations
• Left argument, since stored in main-memory,
should be smaller
• With two relations only two choices of join
tree
• With more than two relations, there are n!
ways to order the arguments and therefore n!
join trees, where n is the no. of relations
Join Trees
• Total # of tree shapes T(n) for n relations given
by recurrence:
•
•
•
•
T(1) = 1
T(2) = 1
T(3) = 2
T(4) = 5 … etc
Left-Deep Join Trees
• Consider 4 relations. Different ways to join
them are as follows
• In fig (a) all the right children are leaves. This is
a left-deep tree
• In fig (c) all the left children are leaves. This is a
right-deep tree
• Fig (b) is a bushy tree
• Considering left-deep trees is advantageous
for deciding join orders
Join order
• Join order selection
– A1
A2
A3
..
– Left deep join trees
An
An
Ai
– Dynamic programming
• Best plan computed for each subset of relations
– Best plan (A1, .., An) = min cost plan of(
Best plan(A2, .., An)
A1
Best plan(A1, A3, .., An)
A2
….
Best plan(A1, .., An-1))
An
Dynamic Programming to Select a Join
Order and Grouping
• To pick an order for the join of many relations there are three
choices:
– Consider all of them.
– Consider a subset
– Use a heuristic to pick one
• Dynamic programming is used either to consider all or a subset
– Construct a table of costs based on relation size
– Remember only the minimum entry which will required to
proceed
Dynamic Programming to Select a Join
Order and Grouping
Disadvantage:
– Dynamic programming is that it does not involve the actual
joins in the calculations.
costs of the
– Can be improved by considering
Disk’s I/O use for evaluating cost
When computing cost of R1 join R2, since we sum cost of R1 and R2, we must also
compute estimates for there sizes
Dynamic Programming to Select a Join
Order and Grouping
Dynamic Programming to Select a Join
Order and Grouping
Dynamic Programming to Select a Join
Order and Grouping
A Greedy Algorithm for Selecting a Join
Order
• It is expensive to use an exhaustive method
like dynamic programming
• Better approach is to use a join-order heuristic
for the query optimization
• Greedy algorithm is an example of that
– Make one decision at a time about order of join
and never backtrack on the decisions once made
Query Compiler: 16.7 Completing
the Physical Query-Plan
CS257 Spring 2009
Professor Tsau Lin
Student: Suntorn Sae-Eung
ID: 212
Outline
16.7 Completing the Physical-Query-Plan
I. Choosing a Selection Method
II. Choosing a Join Method
III. Pipelining Versus Materialization
IV. Pipelining Unary Operations
V. Pipelining Binary Operations
Before complete Physical-QueryPlan

A query previously has been

Parsed and Preprocessed (16.1)

Converted to Logical Query Plans (16.3)

Estimated the Costs of Operations (16.4)

Determined costs by Cost-Based Plan Selection (16.5)

Weighed costs of join operations by choosing an Order
for Joins
16.7 Completing the Physical-QueryPlan
3 topics related to turning LP into a complete
physical plan




Choosing of physical implementations such as
Selection and Join methods
Decisions regarding to intermediate results
(Materialized or Pipelined)
Notation for physical-query-plan operators
I. Choosing a Selection Method (A)

Algorithms for each selection operators
1. Can we use an created index on an attribute?

If yes, index-scan. Otherwise table-scan
2. After retrieve all condition-satisfied tuples in (1), then
filter them with the rest selection conditions
Choosing a Selection Method(A) (cont.)

Recall  Cost of query = # disk I/O’s

How costs for various plans are estimated from σC(R) operation
1. Cost of table-scan algorithm

B(R)
if R is clustered

T(R)
if R is not clustered
2. Cost of a plan picking an equality term (e.g. a = 10) w/ index-scan

B(R) / V(R, a) clustering index

T(R) / V(R, a) nonclustering index
3. Cost of a plan picking an inequality term (e.g. b < 20) w/ index-scan

B(R) / 3
clustering index

T(R) / 3
nonclustering index
Example
Selection: σx=1 AND y=2 AND z<5 (R)
- Where paremeters of R(x, y, z) are :
T(R)=5000,
B(R)=200,
V(R,x)=100, and V(R, y)=500
-
Relation R is clustered
-
x, y have nonclustering indexes, only index on z is
clustering.
Example (cont.)
Selection options:




Table-scan  filter x, y, z. Cost is B(R) = 200 since R is
clustered.
Use index on x =1  filter on y, z. Cost is 50 since T(R)
V(R, x) is (5000/100) = 50 tuples, index is not clustering.
/
Use index on y =2  filter on x, z. Cost is 10 since T(R) /
V(R, y) is (5000/500) = 10 tuples using nonclustering index.
Index-scan on clustering index w/ z < 5  filter x ,y. Cost is
about B(R)/3 = 67
Example (cont.)
Costs

option 1 = 200
option 2 = 50
option 3 = 10

option 4 = 67
The lowest Cost is option 3.
Therefore, the preferred physical plan

1.
retrieves all tuples with y = 2
2.
then filters for the rest two conditions (x, z).
II. Choosing a Join Method

Determine costs associated with each join algorithms:
1. One-pass join, and nested-loop join devotes enough buffer to joining or the
buffer manager can come close, so thrashing is not a major cost.
2. Sort-join is preferred when attributes are pre-sorted or two or more join on
the same attribute such as
(R(a, b)
S(a, c))
T(a, d)
- where sorting R and S on a will produce result of R S to be sorted on a
and used directly in next join
Choosing a Join Method (cont.)
3. Index-join for a join with high chance of using index
created on the join attribute such as R(a, b) S(b, c)
4. Hashing join is the best choice for unsorted or nonindexing relations which needs multi-pass join.
III. Pipelining Versus Materialization

Materialization (naïve way)


store (intermediate) result of each operations on disk
Pipelining (more efficient way)

Interleave the execution of several operations, the tuples produced by one
operation are passed directly to the operations that used it

store (intermediate) result of each operations on buffer, which is implemented
on main memory
IV. Pipelining Unary Operations


Unary = a-tuple-at-a-time or full relation
selection and projection are the best candidates for
pipelining.
In buf
Unary
operation
Out buf
Unary
operation
Out buf
R
In buf
M-1 buffers
Pipelining Unary Operations (cont.)

Pipelining Unary Operations are implemented by iterators
V. Pipelining Binary Operations




Binary operations : ,  , - , , x
The results of binary operations can also be
pipelined.
Use one buffer to pass result to its consumer, one
block at a time.
The extended example shows trade-offs and
opportunities
Example

Consider physical query plan for the expression
(R(w, x)

S(x, y))
U(y, z)
Assumption

R occupies 5,000 blocks, S and U each 10,000 blocks.

The intermediate result R S occupies k blocks for some k.


Both joins will be implemented as hash-joins, either one-pass
or two-pass depending on k
There are 101 buffers available.
Example (cont.)

R
First consider join
S, neither relations
fits in buffers

Needs two-pass
hash-join to partition
R into 100 buckets
(maximum possible) each bucket has 50 blocks

The 2nd pass hash-join uses 51 buffers, leaving the rest 50
buffers for joining result of R S with U.
Example (cont.)
Case 1: suppose k  49, the result of
occupies at most 49 blocks.

R
S
Steps


Pipeline in R
S into 49 buffers

Organize them for lookup as a hash table

Use one buffer left to read each block of U in turn

Execute the second join as one-pass join.
Example (cont.)

The total number of I/O’s is
55,000


45,000 for two-pass hash join
of R and S
10,000 to read U for one-pass
hash join of (R S) U.
Example (cont.)




Case 2: suppose k > 49 but < 5,000, we can still pipeline,
but need another strategy which intermediate results join
with U in a 50-bucket, two-pass hash-join. Steps are:
Before start on R
each.
S, we hash U into 50 buckets of 200 blocks
Perform two-pass hash join of R and U using 51 buffers as case 1,
and placing results in 50 remaining buffers to form 50 buckets for
the join of R S with U.
Finally, join R
S with U bucket by bucket.
Example (cont.)

The number of disk I/O’s is:

20,000 to read U and write its tuples into buckets

45,000 for two-pass hash-join R

k to write out the buckets of R


S
S
k+10,000 to read the buckets of R
join
The total cost is 75,000+2k.
S and U in the final
Example (cont.)

Compare Increasing I/O’s between case 1 and case
2

k  49 (case 1)


Disk I/O’s is 55,000
k > 50  5000 (case 2)



k=50 , I/O’s is 75,000+(2*50) = 75,100
k=51 , I/O’s is 75,000+(2*51) = 75,102
k=52 , I/O’s is 75,000+(2*52) = 75,104
Notice: I/O’s discretely grows as k increases from 49 50.
Example (cont.)



Case 3: k > 5,000, we cannot perform two-pass
join in 50 buffers available if result of R S is
pipelined. Steps are
Compute R S using two-pass join and store the result on
disk.
Join result on (1) with U, using two-pass join.
Example (cont.)


The number of disk I/O’s is:

45,000 for two-pass hash-join R and S

k to store R S on disk

30,000 + k for two-pass join of U in R
The total cost is 75,000+4k.
S
Example (cont.)

In summary, costs of physical plan as function of
R S size.
Completing the PhysicalQuery-Plan and Chapter 16
Summary (16.7-16.8)
CS257 Spring 2009
Professor Tsau Lin
Student: Suntorn Sae-Eung
Donavon Norwood
Outline
16.7 Completing the Physical-Query-Plan
VI. Notation for Physical Query Plan
VII. Ordering the Physical Operations
16.8 Summary of Chapter 16
Before complete Physical-QueryPlan

A query previously has been

Parsed and Preprocessed (16.1)

Converted to Logical Query Plans (16.3)

Estimated the Costs of Operations (16.4)

Determined costs by Cost-Based Plan Selection (16.5)

Weighed costs of join operations by choosing an Order
for Joins
VI. Notation for Physical Query Plans
Several types of operators:


1.
Operators for leaves
2.
(Physical) operators for Selection
3.
(Physical) Sorts Operators
4.
Other Relational-Algebra Operations
In practice, each DBMS uses its own internal
notation for physical query plan.
Notation for Physical Query Plans
(cont.)
Operator for leaves
1.

A leaf operand is replaced in LQP tree will be
replaced by a scan operator. The options are:




TableScan(R) : read all blocks
SortScan(R, L) : read in order according to L
IndexScan(R, C): scan index attribute A by condition C
of form Aθc.
IndexScan(R, A) : scan index attribute R.A. This
behaves like TableScan but more efficient if R is not
clustered.
Notation for Physical Query Plans
(cont.)
(Physical) operators for Selection
1.

Logical operator σC(R) is often combined with
access methods.

If σC(R) is replaced by Filter(C), and there is no index on
R or an attribute on condition C
Use TableScan or SortScan(R, L) to access R
If condition C  Aθc AND D for condition D, and there
is an index on R.A, then we may
Use operator IndexScan(R, Aθc) to access R and
Use Filter(D) in place of the selection σC(R)




Notation for Physical Query Plans
(cont.)
(Physical) Sort Operators
1.



Sorting can occur any point in physical query plan,
which use a notation SortScan(R, L).
It is common to use an explicit operator Sort(L) to
sort relation that is not stored.
Can apply at the top of physical-query-plan tree if the
result needs to be sorted with ORDER BY clause in
the original query.
Notation for Physical Query Plans
(cont.)
Other Relational-Algebra Operations
1.

Descriptive text definitions and signs to elaborate

Operations performed: Join or grouping.

Necessary parameters: Theta-join or list of elements in a
grouping.

A general strategy for the algorithm: sort-based, hashed
based, or index-based.

A decision about number of passed to be used: one-pass,
two-pass or multipass.

An anticipated number of buffers the operations will
required.
Notation for Physical Query Plans
(cont.)

Example of a physical-query-plan

A physical-query-plan in example 16.36 for the case k > 5000

TableScan

Two-pass hash join

Materialize (double line)

Store operator
Notation for Physical Query Plans
(cont.)

Another example

A physical-query-plan in example 16.36 for the case k < 49

TableScan

(2) Two-pass hash join

Pipelining

Different buffers needs

Store operator
Notation for Physical Query Plans
(cont.)

A physical-query-plan in example 16.35

Use Index on condition y = 2 first

Filter with the rest condition later on.
VII. Ordering of Physical
Operations
The PQP is represented as a tree structure
implied order of operations.

Still, the order of evaluation of interior nodes may
not always be clear.



Iterators are used in pipeline manner
Overlapped time of various nodes will make
“ordering” no sense.
Ordering of Physical Operations (cont.)
3 rules summarize the ordering of events in a
PQP tree:

1.
Break the tree into sub-trees at each edge that
represent materialization.

2.
Order the execution of the subtree


3.
Execute one subtree at a time.
Bottom-top
Left-to-right
All nodes of each sub-tree are executed
simultaneously.
Summary of Chapter 16
In this part of the presentation I will talk about the
main topics of Chapter 16.
COMPILATION OF
QUERIES
 Compilation means turning a query into a
physical query plan, which can be
implemented by query engine.
 Steps of query compilation :




Parsing
Semantic checking
Selection of the preferred logical query plan
Generating the best physical plan
THE PARSER
 The first step of SQL query processing.
 Generates a parse tree
 Nodes in the parse tree corresponds to
the SQL constructs
 Similar to the compiler of a programming
language
VIEW EXPANSION
 A very critical part of query compilation.
 Expands the view references in the query
tree to the actual view.
 Provides opportunities for the query
optimization.
SEMANTIC CHECKING
 Checks the semantics of a SQL query.
 Examines a parse tree.
 Checks :
 Attributes
 Relation names
 Types
 Resolves attribute references.
CONVERSION TO A
LOGICAL QUERY PLAN
 Converts a semantically parsed tree to a
algebraic expression.
 Conversion is straightforward but sub
queries need to be optimized.
 Two argument selection approach can be
used.
ALGEBRAIC
TRANSFORMATION
 Many different ways to transform a logical query plan to an
actual plan using algebraic transformations.
 The laws used for this transformation :
 Commutative and associative laws
 Laws involving selection
 Pushing selection
 Laws involving projection
 Laws about joins and products
 Laws involving duplicate eliminations
 Laws involving grouping and aggregation
ESTIMATING SIZES OF
RELATIONS
 True running time is taken into
consideration when selecting the best
logical plan.
 Two factors the affects the most in
estimating the sizes of relation :
 Size of relations ( No. of tuples )
 No. of distinct values for each attribute of
each relation
 Histograms are used by some systems.
COST BASED OPTIMIZING
 Best physical query plan represents the
least costly plan.
 Factors that decide the cost of a query
plan :
 Order and grouping operations like joins,
unions and intersections.
 Nested loop and the hash loop joins used.
 Scanning and sorting operations.
 Storing intermediate results.
PLAN ENUMERATION
STRATEGIES
 Common approaches for searching the
space for best physical plan .
 Dynamic programming : Tabularizing the
best plan for each sub expression
 Selinger style programming : sort-order the
results as a part of table
 Greedy approaches : Making a series of
locally optimal decisions
 Branch-and-bound : Starts with enumerating
the worst plans and reach the best plan
LEFT-DEEP JOIN TREES
 Left – Deep Join Trees are the binary
trees with a single spine down the left
edge and with leaves as right children.
 This strategy reduces the number of
plans to be considered for the best
physical plan.
 Restrict the search to Left – Deep Join
Trees when picking a grouping and order
for the join of several relations.
PHYSICAL PLANS FOR
SELECTION
 Breaking a selection into an index-scan of
relation, followed by a filter operation.
 The filter then examines the tuples
retrieved by the index-scan.
 Allows only those to pass which meet the
portions of selection condition.
PIPELINING VERSUS
MATERIALIZING
 This flow of data between the operators can be
controlled to implement “ Pipelining “ .
 The intermediate results should be removed from main
memory to save space for other operators.
 This techniques can implemented using “
materialization “ .
 Both the pipelining and the materialization should be
considered by the physical query plan generator.
 An operator always consumes the result of other
operator and is passed through the main memory.
Query Execution
Chapter 15
Section 15.1
Presented by
Khadke, Suvarna
CS 257
(Section II) Id 213
499
Agenda
 Query Processor and major parts of Query
processor
 Physical-Query-Plan Operators
 Scanning Tables
 Basic approaches to locate the tuples of a
relation R
 Sorting While Scanning Tables
 Computation Model for Physical Operator
 I/O Cost for Scan Operators
 Iterators
500
What is a Query Processor
 Group of components of a DBMS that converts
a user queries and data-modification
commands into a sequence of database
operations
 It also executes those operations
 Must supply detail regarding how the query is
to be executed
501
Major parts of Query processor
Query Execution:
The algorithms
that manipulate the
data of the
database.
Focus on the
operations of
extended relational
algebra.
502
Outline of Query Compilation
Query compilation
 Parsing : A parse tree for the
query is constructed
 Query Rewrite : The parse tree
is converted to an initial query
plan and transformed into
logical query plan (less time)
 Physical Plan Generation :
Logical Q Plan is converted into
physical query plan by selecting
algorithms and order of
execution of these operator. 503
Selection of best Query-Plan
To select the best query plan :
 Which of the algebraically equivalent forms
of a query leads to the most efficient
algorithm for answering the query.
 For each operation of the selected forms,
what algorithm should we use to implement
that operation?
504
Physical-Query-Plan Operators
 Physical operators are implementations of the
operator of relational algebra.
 They can also be use in non relational algebra
operators like “scan” which scans tables, that
is, bring each tuple of some relation into main
memory
505
Scanning Tables
 One of the basic thing we can do in a Physical
query plan is to read the entire contents of a
relation R. Example, we take the union or join
of R with another relation.
 Variation of this operator involves simple
predicate, read only those tuples of the relation
R that satisfy the predicate.
506
Scanning Tables
Basic approaches to locate the tuples of a relation R
 Table
Scan
 Relation R is stored in secondary memory with
its tuples arranged in blocks
 It is possible to get the blocks one by one
 Index-Scan
 If there is an index on any attribute of Relation
R, we can use this index to get all the tuples of
Relation R
507
Sorting While Scanning Tables
 Number of reasons to sort a relation
 Query
could include an ORDER BY clause,
requiring that a relation be sorted.
 Algorithms to implement relational algebra
operations requires one or both arguments to be
sorted relations.
 Physical-query-plan operator sort-scan takes a
relation R, attributes on which the sort is to be
made, and produces R in that sorted order.
508
Computation Model for Physical
Operator
 Physical-Plan Operator should be selected
wisely which is essential for good Query
Processor .
 For “cost” of each operator is estimated by
number of disk I/O’s for an operation.
 The total cost of operation depends on the size
of the answer, and includes the final write back
cost to the total cost of the query.
509
Parameters for Measuring Costs
 Parameters that affect the performance of a query
 Buffer
space availability in the main memory
at the time of execution of the query
 Size of input and the size of the output
generated
 The size of memory block on the disk and the
size in the main memory also affects the
performance
510
Parameters for Measuring Costs
 B: The number of blocks are needed to hold all
tuples of relation R.
 Also denoted as B(R)
 T:The number of tuples in relationR.
 Also denoted as T(R)
V: The number of distinct values that appear in a
column of a relation R
 V(R, a)- is the number of distinct values of
column for a in relation R
511
I/O Cost for Scan Operators
 If relation R is clustered, then the number of
disk I/O for the table-scan operator is = ~B
disk I/O’s
 If relation R is not clustered, then the number
of required disk I/O generally is much higher
 A index on a relation R occupies many fewer
than B(R) blocks
That means a scan of the entire relation R
which takes at least B disk I/O’s will require
more I/O’s than the entire index
512
Iterators for Implementation of
Physical Operators
 Many physical operators can be implemented
as an Iterator.
 Three methods forming the iterator for an
operation are:
 1. Open( ) :
 This method starts the process of getting
tuples
 It initializes any data structures needed to
perform the operation
513
Iterators for Implementation of
Physical Operators
 2. GetNext( ):
 Returns
the next tuple in the result
 If there are no more tuples to return, GetNext
returns a special value NotFound
 3. Close( ) :
 Ends the iteration after all tuples
 It calls Close on any arguments of the
operator
514
Query Execution
One-Pass Algorithms for Database Operations (15.2)
Presented by
Ronak Shah
(214)
April 22, 2009
515
Introduction
 The choice of an algorithm for each operator is an essential
part of the process of transforming a logical query plan into
a physical query plan.
 Main classes of Algorithms:



Sorting-based methods
Hash-based methods
Index-based methods
 Division based on degree difficulty and cost:



1-pass algorithms
2-pass algorithms
3 or more pass algorithms
516
One-Pass Algorithm Methods
 Tuple-at-a-time, unary operations: (selection & projection)
- do not required the entire relation, we can read a block at a
time, use one memory buffer, and produce our output.
 Full-relation, unary operations: These one-argument
operations required seeing all or most of the tuples in
memory at once, so one-pass algorithm are limited to
relations that are approximately of size M or less.
 Full-relation, binary operations :Set & bag versions of
union, intersection, difference, joins and products.
517
One-Pass Algorithms for Tuple-at-a-Time
Operations
 Tuple-at-a-time operations are selection and projection



read the blocks of R one at a time into an input buffer
perform the operation on each tuple
move the selected tuples or the projected tuples to the output
buffer
 The disk I/O requirement for this process depends only on
how the argument relation R is provided.

If R is initially on disk, then the cost is whatever it takes to
perform a table-scan or index-scan of R.
518
A selection or projection being performed
on a relation R
519
One-Pass Algorithms for Unary, fillRelation Operations
 Duplicate Elimination

To eliminate duplicates, we can read each block of R one at a
time, but for each tuple we need to make a decision as to
whether:
1.
2.

It is the first time we have seen this tuple, in which case we
copy it to the output, or
We have seen the tuple before, in which case we must not
output this tuple.
One memory buffer holds one block of R's tuples, and the
remaining M - 1 buffers can be used to hold a single copy of
every tuple.
520
Managing memory for a one-pass
duplicate-elimination
521
Duplicate Elimination
 When a new tuple from R is considered, we compare it with
all tuples seen so far


if it is not equal: we copy both to the output and add it to the
in-memory list of tuples we have seen.
if there are n tuples in main memory: each new tuple takes
processor time proportional to n, so the complete operation
takes processor time proportional to n2.
 We need a main-memory structure that allows each of the
operations:

Add a new tuple, and

Tell whether a given tuple is already there
522
Duplicate Elimination (…contd.)
 The different structures that can be used for such main
memory structures are:


Hash table
Balanced binary search tree
523
One-Pass Algorithms for Unary, fillRelation Operations
 Grouping

The grouping operation gives us zero or more grouping
attributes and presumably one or more aggregated attributes

If we create in main memory one entry for each group then
we can scan the tuples of R, one block at a time.

The entry for a group consists of values for the grouping
attributes and an accumulated value or values for each
aggregation.
524
Grouping
A grouping operation gives us zero or more
grouping attributes and presumably one or more
aggregated attributes. If we create in main memory
one entry for each group- that is. for each value of
the grouping attributes - then we can scan the tuples
of R. one block at a time. The entry for a group
consists of values for the grouping attributes and an
accumulated value or values for each aggregation.
The accumulated value is:
 For MIN(a) or MAX(a) aggregate, record
525
One-Pass Algorithms for Binary Operations
 Binary operations include:





Union
Intersection
Difference
Product
Join
526
Set Union
 We read S into M - 1 buffers of main memory and build a
search structure where the search key is the entire tuple.
 All these tuples are also copied to the output.
 Read each block of R into the Mth buffer, one at a time.
 For each tuple t of R, see if t is in S, and if not, we copy t to
the output. If t is also in S, we skip t.
527
Set Intersection
 Read S into M - 1 buffers and build a search structure with
full tuples as the search key.
 Read each block of R, and for each tuple t of R, see if t is
also in S. If so, copy t to the output, and if not, ignore t.
528
Set Difference
 Read S into M - 1 buffers and build a search structure with
full tuples as the search key.
 To compute R -s S, read each block of R and examine each
tuple t on that block. If t is in S, then ignore t; if it is not in
S then copy t to the output.
 To compute S -s R, read the blocks of R and examine each
tuple t in turn. If t is in S, then delete t from the copy of S in
main memory, while if t is not in S do nothing.
 After considering each tuple of R, copy to the output those
tuples of S that remain.
529
Bag Intersection
 Read S into M - 1 buffers.
 Multiple copies of a tuple t are not stored individually.
Rather store 1 copy of t & associate with it a count equal to
no. of times t occurs.
 Next, read each block of R, & for each tuple t of R see
whether t occurs in S. If not ignore t; it cannot appear in the
intersection. If t appears in S, & count associated with t is
(+)ve, then output t & decrement count by 1. If t appears in
S, but count has reached 0, then do not output t; we have
already produced as many copies of t in output as there
were copies in S.
530
Bag Difference
 To compute S -B R, read tuples of S into main memory &
count no. of occurrences of each distinct tuple.
 Then read R; check each tuple t to see whether t occurs in S,
and if so, decrement its associated count. At the end, copy
to output each tuple in main memory whose count is
positive, & no. of times we copy it equals that count.
 To compute R -B S, read tuples of S into main memory &
count no. of occurrences of distinct tuples.
531
Bag Difference (…contd.)
 Think of a tuple t with a count of c as c reasons not to copy
t to the output as we read tuples of R.
 Read a tuple t of R; check if t occurs in S. If not, then copy t
to the output. If t does occur in S, then we look at current
count c associated with t. If c = 0, then copy t to output. If c
> 0, do not copy t to output, but decrement c by 1.
532
Product
 Read S into M - 1 buffers of main memory
 Then read each block of R, and for each tuple t of R
concatenate t with each tuple of S in main memory.
 Output each concatenated tuple as it is formed.
 This algorithm may take a considerable amount of
processor time per tuple of R, because each such tuple must
be matched with M - 1 blocks full of tuples. However,
output size is also large, & time/output tuple is small.
533
Natural Join
 Convention: R(X, Y) is being joined with S(Y, Z), where Y
represents all the attributes that R and S have in common, X
is all attributes of R that are not in the schema of S, & Z is
all attributes of S that are not in the schema of R. Assume
that S is the smaller relation.
 To compute the natural join, do the following:
Read all tuples of S & form them into a main-memory
search structure.
Hash table or balanced tree are good e.g. of such structures.
Use M - 1 blocks of memory for this purpose.
1.
534
Natural Join (…contd.)
Read each block of R into 1 remaining main-memory
buffer.
For each tuple t of R, find tuples of S that agree with
t on all attributes of Y, using the search structure.
For each matching tuple of S, form a tuple by joining
it with t, & move resulting tuple to output.
1.
535
15.3
Nested-Loop Joins
By:
Saloni Tamotia (215)
Introduction to Nested-Loop Joins






Used for relations of any side.
Not necessary that relation fits in main memory
Uses “One-and-a-half” pass method in which for
each variation:
One argument read just once.
Other argument read repeatedly.
Can be used for relations of any size.
Two kinds:


Tuple-Based Nested Loop Join
Block-Based Nested Loop Join
ADVANTAGES OF NESTED-LOOP JOIN
Fits in the iterator framework.
 Allows us to avoid storing
intermediate relation on disk.

Tuple-Based Nested-Loop
Join
 Simplest variation of the
nested-loop join
 Loop ranges over individual
tuples
Tuple-Based Nested-Loop Join

Algorithm to compute the Join R(X,Y) | | S(Y,Z)
FOR each tuple s in S DO
FOR each tuple r in R DO
IF r and s join to make tuple t THEN
output t


R and S are two Relations with r and s as tuples.
carelessness in buffering of blocks causes the use
of T(R)T(S) disk I/O’s
IMPROVEMENT & MODIFICATION
To decrease the cost

Method 1: Use algorithm for Index-Based joins



We find tuple of R that matches given tuple of S
We need not to read entire relation R
Method 2: Use algorithm for Block-Based joins
Tuples of R & S are divided into blocks
 Uses enough memory to store blocks in order to
reduce the number of disk I/O’s.

Block-Based Nested-Loop Join Algorithm


Access to arguments is organized by block.
 While reading tuples of inner relation we use
less number of I/O’s disk.
Using enough space in main memory to store
tuples of relation of the outer loop.
 Allows to join each tuple of the inner relation
with as many tuples as possible.
Block-Based Nested-Loop Join Algorithm
ALGORITHM:
FOR each chunk of M-1 blocks of S DO
 Read this blocks into main-memory buffers;
FOR each block b of R DO
 Read b into main memory
FOR each tuple t of b DO
 Find the tuples of S in main memory that join with t;
find the tuples of S in memory that join with t
output the join of t with each of these tuples
Block-Based Nested-Loop Join Algorithm

Assumptions:
B(S) ≤ B(R)
 B(S) > M

This means that the neither relation fits in
the entire main memory.
Analysis of Nested-Loop Join

Number of disk I/O’s:
[B(S)/(M-1)]*(M-1 +B(R))
or
B(S) + [B(S)B(R)/(M-1)]
or approximately B(S)*B(R)/M



Assuming all of M.B(S),B(R) are large, but M
is the smallest of these,
~B(S)*B(R)/M.
i.e. the cost is proportional to the product of
the sizes of the two relations, divided by the
Two-Pass Algorithms
Based on Sorting
SECTION 15.4
Rupinder Singh
Two-Pass Algorithms Based on
Sorting

Two-pass Algorithms: where data from the operand
relations is read into main memory, processed in
some way, written out to disk again, and then reread
from disk to complete the operation
Basic idea



Step 1: Read M blocks of R into main memory.
Step 2:Sort these M blocks in main memory, using
an efficient, main-memory sorting algorithm. so we
expect that the time to sort will not exceed the disk
1/0 time for step (1).
Step 3: Write the sorted list into M blocks of disk.
Duplicate Elimination Using Sorting
δ(R)





First sort the tuples of R in sublists
Then use the available main memory to hold one
block from each sorted sublist
Then repeatedly copy one to the output and ignore
all tuples identical to it.
The total cost of this algorithm is 3B(R)
This algorithm requires only √B(R)blocks of main
memory, rather than B(R) blocks(one-pass
algorithm).
Example

Suppose that tuples are integers, and only two
tuples fit on a block. Also, M = 3 and the relation R
consists of 17 tuples:
2,5,2,1,2,2,4,5,4,3,4,2,1,5,2,1,3

After first-pass
Sublists
Elements
R1
1,2,2,2,2,5
R2
2,3,4,4,4,5
R3
1,1,2,3,5
Example

Second pass
Sublist
In memory
Waiting on disk
R1
1,2
2,2, 2,5
R2
2,3
4,4, 4,5
R3
1,1
2,3,5
In memory
Waiting on disk
R1
2
2,2, 2,5
R2
2,3
4,4, 4,5
R3
2,3
5
After processing tuple
1
Sublist
Output: 1
Continue the same process with next tuple.
Grouping and Aggregation Using
Sorting γ(R)





Two-pass algorithm for grouping and aggregation is quite similar to
the previous algorithm.
Step 1:Read the tuples of R into memory, M blocks at a time. Sort
each M blocks, using the grouping attributes of L as the sort key.
Write each sorted sublist to disk.
Step 2:Use one main-memory buffer for each sublist, and initially
load the first block of each sublist into its buffer.
Step 3:Repeatedly find the least value of the sort key (grouping
attributes) present among the first available tuples in the buffers.
This algorithm takes 3B(R) disk 1/0's, and will work as long as B(R) <
M².
A Sort-Based Union Algorithm

For bag-union one-pass algorithm is used.

For set-union





Step 1:Repeatedly bring M blocks of R into main memory, sort their tuples, and
write the resulting sorted sublist back to disk.
Step 2:Do the same for S, to create sorted sublists for relation S.
Step 3:Use one main-memory buffer for each sublist of R and S. Initialize each
with the first block from the corresponding sublist.
Step 4:Repeatedly find the first remaining tuple t among all the buffers. Copy t
to the output. and remove from the buffers all copies of t (if R and S are sets
there should be at most two copies)
This algorithm takes 3(B(R)+B(S)) disk 1/0's, and will work as long as
B(R)+B(S) < M².
Sort-Based Intersection and
Difference





For both set and bag version, the algorithm is same as that of
set-union except that the way we handle the copies of a tuple t
at the fronts of the sorted sublists.
For set intersection, output t if it appears in both R and S.
For bag intersection, output t the minimum of the number of
times it appears in R and in S.
For set difference, R-S, output t if and only if it appears in R but
not in S.
For bag difference, R-S, output t the number of times it appears
in R minus the number of times it appears in S.
A Simple Sort-Based Join Algorithm


When taking a join, the number of tuples from the two
relations that share a common value of the join
attribute(s), need to take in main memory
simultaneously, can exceed what fits in memory
To avoid this situation, try to reduce main-memory use
for other aspects of the algorithm, make available a
large number of buffers to hold the tuples with a given
join-attribute value
A Simple Sort-Based Join Algorithm



Given relations R(X, Y) and S(Y, Z) to join, and given M blocks of main
memory for buffers.
Step 1:Sort R and S, using a two-phase, multiway merge sort, with Y as the
sort key.
Step 2:Merge the sorted R and S. The following steps are done repeatedly:





Find the least value y of the join attributes Y that is currently at the front of the
blocks for R and S.
If y does not appear at the front of the other relation, then remove the tuple(s)
with sort key y.
Otherwise, identify all the tuples from both relations having sort key y.
Output all the tuples that can be formed by joining tuples from R and S with a
common Y-value y.
If either relation has no more unconsidered tuples in main memory.,reload the
buffer for that relation.
A Simple Sort-Based Join Algorithm

The simple sort-join uses 5(B(R) + B(S)) disk I/0's.

It requires B(R) ≤ M² and B(S) ≤ M² to work.
A More Efficient Sort-Based Join


If we do not have to worry about very large numbers of tuples with
a common value for the join attribute(s), then we can save two disk
1/0's per block by combining the second phase of the sorts with
the join itself
To compute R(X, Y) ►◄ S(Y, Z) using M main-memory buffers



Create sorted sublists of size M, using Y as the sort key, for both R and S.
Bring the first block of each sublist into a buffer
Repeatedly find the least Y-value y among the first available tuples of all
the sublists. Identify all the tuples of both relations that have Y-value y.
Output the join of all tuples from R with all tuples from S that share this
common Y-value
A More Efficient Sort-Based Join

The number of disk I/O’s is 3(B(R) + B(S))

It requires B(R) + B(S) ≤ M² to work
Summary of Sort-Based Algorithms
Operators
Approximate
Disk I/O
M required
γ,δ
√B
3B
U,∩,−
√(B(R) + B(S))
3(B(R) + B(S))
►◄
√(max(B(R),B(S)))
5(B(R) + B(S))
►◄(more efficient)
√(B(R) + B(S))
3(B(R) + B(S))
By
Swathi Vegesna
At a glimpse
 Introduction
 Partitioning Relations by Hashing
 Algorithm for Duplicate Elimination
 Grouping and Aggregation
 Union, Intersection, and Difference
 Hash-Join Algorithm
 Sort based Vs Hash based
 Summary
Introduction
Hashing is done if the data is too big to store in main
memory buffers.
 Hash all the tuples of the argument(s) using an
appropriate hash key.
 For all the common operations, there is a way to select the
hash key so all the tuples that need to be considered
together when we perform the operation have the same
hash value.
 This reduces the size of the operand(s) by a factor equal
to the number of buckets.
Introduction_Cont...
 This reduces the size of the operand(s) by a factor equal
to the number of buckets.
 Example : If there are M buffer available, we can pick M
as the number of buckets, thus gaining a factor of M in
the size of the relations we can handle.
Partitioning Relations by Hashing
Algorithm:
initialize M-1 buckets using M-1 empty buffers;
FOR each block b of relation R DO BEGIN
read block b into the Mth buffer;
FOR each tuple t in b DO BEGIN
IF the buffer for bucket h(t) has no room for t THEN
BEGIN
copy the buffer t o disk;
initialize a new empty block in that buffer;
END;
copy t to the buffer for bucket h(t);
END ;
END ;
FOR each bucket DO
IF the buffer for this bucket is not empty THEN
write the buffer to disk;
Duplicate Elimination
 For the operation δ(R) hash R to M-1 Buckets.
(Note that two copies of the same tuple t will hash to the same
bucket)
 Do duplicate elimination on each bucket Ri independently,
using one-pass algorithm
 The result is the union of δ(Ri), where Ri is the portion of R
that hashes to the ith bucket
Requirements
 Number of disk I/O's: 3*B(R)
 B(R) < M(M-1), only then the two-pass, hash-based




algorithm will work
In order for this to work, we need:
hash function h evenly distributes the tuples among the
buckets
each bucket Ri fits in main memory (to allow the one-pass
algorithm)
i.e., B(R) ≤ M2
Grouping and Aggregation
 Hash all the tuples of relation R to M-1 buckets, using a hash
function that depends only on the grouping attributes
(Note: all tuples in the same group end up in the same
bucket)
 Use the one-pass algorithm to process each bucket
independently
 Uses 3*B(R) disk I/O's, requires B(R) ≤ M2
Union, Intersection, and Difference
 For binary operation we use the same hash function to
hash tuples of both arguments.
 R U S we hash both R and S to M-1
 R ∩ S we hash both R and S to 2(M-1)
 R-S we hash both R and S to 2(M-1)
 Requires 3(B(R)+B(S)) disk I/O’s.
 Two pass hash based algorithm requires
min(B(R)+B(S))≤ M2
Hash-Join Algorithm
 Use same hash function for both relations; hash function
should depend only on the join attributes
 Hash R to M-1 buckets R1, R2, …, RM-1
 Hash S to M-1 buckets S1, S2, …, SM-1
 Do one-pass join of Ri and Si, for all i
 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2
Sort based Vs Hash based
 For binary operations, hash-based only limits size to
min of arguments, not sum
 Sort-based can produce output in sorted order, which
can be helpful
 Hash-based depends on buckets being of equal size
 Sort-based algorithms can experience reduced
rotational latency or seek time
By
Swathi Vegesna
At a glimpse
 Introduction
 Partitioning Relations by Hashing
 Algorithm for Duplicate Elimination
 Grouping and Aggregation
 Union, Intersection, and Difference
 Hash-Join Algorithm
 Sort based Vs Hash based
 Summary
Introduction
Hashing is done if the data is too big to store in main
memory buffers.
 Hash all the tuples of the argument(s) using an
appropriate hash key.
 For all the common operations, there is a way to select the
hash key so all the tuples that need to be considered
together when we perform the operation have the same
hash value.
 This reduces the size of the operand(s) by a factor equal
to the number of buckets.
Introduction_Cont...
 This reduces the size of the operand(s) by a factor equal
to the number of buckets.
 Example : If there are M buffer available, we can pick M
as the number of buckets, thus gaining a factor of M in
the size of the relations we can handle.
Partitioning Relations by Hashing
Algorithm:
initialize M-1 buckets using M-1 empty buffers;
FOR each block b of relation R DO BEGIN
read block b into the Mth buffer;
FOR each tuple t in b DO BEGIN
IF the buffer for bucket h(t) has no room for t THEN
BEGIN
copy the buffer t o disk;
initialize a new empty block in that buffer;
END;
copy t to the buffer for bucket h(t);
END ;
END ;
FOR each bucket DO
IF the buffer for this bucket is not empty THEN
write the buffer to disk;
Duplicate Elimination
 For the operation δ(R) hash R to M-1 Buckets.
(Note that two copies of the same tuple t will hash to the same
bucket)
 Do duplicate elimination on each bucket Ri independently,
using one-pass algorithm
 The result is the union of δ(Ri), where Ri is the portion of R
that hashes to the ith bucket
Requirements
 Number of disk I/O's: 3*B(R)
 B(R) < M(M-1), only then the two-pass, hash-based




algorithm will work
In order for this to work, we need:
hash function h evenly distributes the tuples among the
buckets
each bucket Ri fits in main memory (to allow the one-pass
algorithm)
i.e., B(R) ≤ M2
Grouping and Aggregation
 Hash all the tuples of relation R to M-1 buckets, using a hash
function that depends only on the grouping attributes
(Note: all tuples in the same group end up in the same
bucket)
 Use the one-pass algorithm to process each bucket
independently
 Uses 3*B(R) disk I/O's, requires B(R) ≤ M2
Union, Intersection, and Difference
 For binary operation we use the same hash function to
hash tuples of both arguments.
 R U S we hash both R and S to M-1
 R ∩ S we hash both R and S to 2(M-1)
 R-S we hash both R and S to 2(M-1)
 Requires 3(B(R)+B(S)) disk I/O’s.
 Two pass hash based algorithm requires
min(B(R)+B(S))≤ M2
Hash-Join Algorithm
 Use same hash function for both relations; hash function
should depend only on the join attributes
 Hash R to M-1 buckets R1, R2, …, RM-1
 Hash S to M-1 buckets S1, S2, …, SM-1
 Do one-pass join of Ri and Si, for all i
 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2
Sort based Vs Hash based
 For binary operations, hash-based only limits size to
min of arguments, not sum
 Sort-based can produce output in sorted order, which
can be helpful
 Hash-based depends on buckets being of equal size
 Sort-based algorithms can experience reduced
rotational latency or seek time
Chapter 15.7
Buffer Management
ID: 219
Name: Qun Yu
Class: CS257 219 Spring 2009
Instructor: Dr. T.Y.Lin
What does a buffer manager do?
Assume there are M of main-memory buffers needed for
the operators on relations to store needed data.
In practice:
1) rarely allocated in advance
2) the value of M may vary depending on system
conditions
Therefore, buffer manager is used to allow processes to
get the memory they need, while minimizing the delay
and unclassifiable requests.
The role of the buffer manager
Read/Writes
Requests
Buffers
Buffer
manager
Figure 1: The role of the buffer manager : responds to requests for
main-memory access to disk blocks
15.7.1 Buffer Management Architecture
Two broad architectures for a buffer manager:
1) The buffer manager controls main memory directly.
• Relational DBMS
2) The buffer manager allocates buffers in virtual memory,
allowing the OS to decide how to use buffers.
• “main-memory” DBMS
• “object-oriented” DBMS
Buffer Pool
Key setting for the Buffer manager to be efficient:
The buffer manager should limit the number of buffers in
use so that they fit in the available main memory, i.e.
Don’t exceed available space.
The number of buffers is a parameter set when the DBMS is
initialized.
No matter which architecture of buffering is used, we simply
assume that there is a fixed-size buffer pool, a set of
buffers available to queries and other database actions.
Buffer Pool
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
DISK
DB
choice of frame dictated
by replacement policy

Data must be in RAM for DBMS to operate on it!

Buffer Manager hides the fact that not all data is in RAM.
15.7.2 Buffer Management Strategies
Buffer-replacement strategies:
When a buffer is needed for a newly requested
block and the buffer pool is full, what block to
throw out the buffer pool?
Buffer-replacement strategy -- LRU
Least-Recently Used (LRU):
To throw out the block that has not been read or written
for the longest time.
•
Requires more maintenance but it is effective.
• Update the time table for every access.
• Least-Recently Used blocks are usually less likely to
be accessed sooner than other blocks.
Buffer-replacement strategy -- FIFO
First-In-First-Out (FIFO):
The buffer that has been occupied the longest by the
same block is emptied and used for the new block.
•
Requires less maintenance but it can make more
mistakes.
• Keep only the loading time
• The oldest block doesn’t mean it is less likely to be
accessed.
Example: the root block of a B-tree index
Buffer-replacement strategy – “Clock”
The “Clock” Algorithm (“Second Chance”)
Think of the 8 buffers as arranged in a circle, shown as
Figure 3
Flag 0 and 1:
buffers with a 0 flag are ok to sent their contents
back to disk, i.e. ok to be replaced
buffers with a 1 flag are not ok to be replaced
Buffer-replacement strategy – “Clock”
0
0
1
0
the buffer with
a 0 flag will be
replaced
0
0
1
1
Start point to
search a 0 flag
The flag will
be set to 0
By next time the hand
reaches it, if the content of
this buffer is not accessed,
i.e. flag=0, this buffer will be
replaced.
That’s “Second Chance”.
Figure 3: the clock algorithm
Buffer-replacement strategy -- Clock
a buffer’s flag set to 1 when:
a block is read into a buffer
the contents of the buffer is accessed
a buffer’s flag set to 0 when:
the buffer manager needs a buffer for a new block, it
looks for the first 0 it can find, rotating clockwise. If it
passes 1’s, it sets them to 0.
System Control helps Buffer-replacement strategy
System Control
The query processor or other components of a DBMS can
give advice to the buffer manager in order to avoid some
of the mistakes that would occur with a strict policy such
as LRU,FIFO or Clock.
For example:
A “pinned” block means it can’t be moved to disk without
first modifying certain other blocks that point to it.
In FIFO, use “pinned” to force root of a B-tree to remain
in memory at all times.
15.7.3 The Relationship Between Physical
Operator Selection and Buffer Management
Problem:
Physical Operator expected certain number of
buffers M for execution.
However, the buffer manager may not be able
to guarantee these M buffers are available.
15.7.3 The Relationship Between Physical
Operator Selection and Buffer Management
Questions:
Can the algorithm adapt to changes of M, the
number of main-memory buffers available?
When available buffers are less than M, and some
blocks have to be put in disk instead of in memory.
How the buffer-replacement strategy impact the
performance (i.e. the number of additional I/O’s)?
Example
FOR each chunk of M-1 blocks of S DO BEGIN
read these blocks into main-memory buffers;
organize their tuples into a search structure whose
search key is the common attributes of R and S;
FOR each block b of R DO BEGIN
read b into main memory;
FOR each tuple t of b DO BEGIN
find the tuples of S in main memory that
join with t ;
output the join of t with each of these tuples;
END ;
END ;
END ;
Figure 15.8: The nested-loop join algorithm
Example
The outer loop number (M-1) depends on the average
number of buffers are available at each iteration.
The outer loop use M-1 buffers and 1 is reserved for a block
of R, the relation of the inner loop.
If we pin the M-1 blocks we use for S on one iteration of the
outer loop, we shall not lose their buffers during the round.
Also, more buffers may become available and then we could
keep more than one block of R in memory.
Will these extra buffers improve the running time?
Example
CASE1: NO
Buffer-replacement strategy: LRU
Buffers for R: k
We read each block of R in order into buffers.
By end of the iteration of the outer loop, the last k blocks of R
are in buffers.
However, next iteration will start from the beginning of R
again.
Therefore, the k buffers for R will need to be replaced.
Example
CASE 2: YES
Buffer-replacement strategy: LRU
Buffers for R: k
We read the blocks of R in an order that alternates:
firstlast and then lastfirst.
In this way, we save k disk I/Os on each iteration of the outer
loop except the first iteration.
Other Algorithms and M buffers
Other Algorithms also are impact by M and the bufferreplacement strategy.
Sort-based algorithm
If M shrinks, we can change the size of a sublist.
Unexpected result: too many sublists to allocate
each sublist a buffer.
Hash-based algorithm
If M shrinks, we can reduce the number of
buckets, as long as the buckets still can fit in M
buffers.
Chapter 15
Query Execution
15.8 Algorithms using more than
two passes
Presented by: Kai Zhu
Professor: Dr. T.Y. Lin
Class ID: 220
Contents



Use of more than 2 pass Algorithms
Multi-pass Sort-based Algorithms
Conclusion
Use of more than 2 pass Algorithms
Two passes are usually enough, but for the large
relation, by using as many passes as necessary,
can processes a relation of arbitrary size.
Multi-pass Sort-based Algorithms
The two-phase multiway merge sort could be
extended to three-pass algorithm
Suppose we have M main-memory buffers
available to sort a relation R, which we assume
is stored clustered.
Then we do the following:
BASIS:
If R fits in M blocks (i.e., B(R)<=M)
1. Read R into main memory.
2. Sort it using any main-memory
sorting algorithm.
3. Write the sorted relation to disk.
INDUCTION:
If R does not fit into main memory.
1. Partition the blocks holding R into
M
groups, which we shall call R1,
R2, R3…
2. Recursively sort Ri for each
i=1,2,3…M.
3. Merge the M sorted sublists.
If we are not merely sorting R, but performing
a unary operation such as δ or γ on R.
We can modify the above so that at the final
merge we perform the operation on the tuples
at the front of the sorted sublists.
That is:


For a δ, output one copy of each distinct tuple,
and skip over copies of the tuple.
For a γ, sort on the grouping attributes only,
and combine the tuples with a given value of
these grouping attributes.
Conclusion
The two pass algorithms based on sorting or
hashing have natural recursive analogs that
take three or more passes and will work for
larger amounts of data.
Thank You