xanadu.cs.sjsu.edu

Transcript xanadu.cs.sjsu.edu

1
SECONDARY STORAGE
MANAGEMENT
CHAPTER 13
SUBMITTED BY
KHADKE, SUVARNA
CS 257
(SECTION II) ID 213
STUDENT ID :-005226235
13.1.1 Memory Hierarchy



Several components for data storage having
different data capacities available
Cost per byte to store data also varies
Device with smallest capacity offer the fastest
speed with highest cost per bit
Memory Hierarchy Diagram
Programs,
Main Memory DBMS’s
As Visual Memory
DBMS
Tertiary Storage
Disk
Main Memory
Cache
File System
13.1.1. Memory Hierarchy

Cache








Lowest level of the hierarchy
Data items are copies of certain locations of main memory
Sometimes, values in cache are changed and corresponding changes to main
memory are delayed
Machine looks for instructions as well as data for those instructions in the cache
Holds limited amount of data
Cache is found on the same chip as the microprocessor
itself, and additional level-2 cache is found on another chip.
Data and instructions are moved to cache from main
memory when they are needed by the processor.
Cache data can be accessed by the processor in a few
nanoseconds.
13.1.1 Memory Hierarchy


No need to update the data in main memory
immediately in a single processor computer
In multiple processors data is updated
immediately to main memory….called as write
through
Main Memory




Everything happens in the computer i.e. instruction execution, data
manipulation, as working on information that is resident in main
memory
Main memories are random access….one can obtain any byte in
the same amount of time
In the center of the action is the computer's main
memory. We may think of everything that happens
in the computer - instruction executions and data
manipulations - as working on information that is
resident in main memory
Typical times to access data from main memory to
the processor or cache are in the 10-100
nanosecond range
Secondary storage




Used to store data and programs when they
are not being processed
More permanent than main memory, as data
and programs are retained when the power is
turned off
E.g. magnetic disks, hard disks
Database systems always involve secondary
storage like the disks and other devices that
store large amount of data that persists over
time.
Tertiary Storage




Holds data volumes in terabytes
Used for databases much larger than what can
be stored on disk
As capacious as a collection of disk units can
be, there are databases much larger than what
can be stored on the disk(s) of a single
machine, or even of a substantial collection of
machines.
Tertiary storage devices have been developed
to hold data volumes measured in terabytes.
13.1.2 Transfer of Data Between
levels





Data moves between adjacent levels of the hierarchy
At the secondary or tertiary levels accessing the desired data or
finding the desired place to store the data takes a lot of time
Disk is organized into bocks
Entire blocks are moved to and from memory called a buffer
At the secondary and tertiary levels, accessing the
desired data or finding the desired place to store
data takes a great deal of time, so each level is
organized to transfer large amount of data or from
the level below, whenever any data at all is needed.
13.1.2 Transfer of Data Between
level (cont’d)


A key technique for speeding up database
operations is to arrange the data so that when
one piece of data block is needed it is likely
that other data on the same block will be
needed at the same time
Same idea applies to other hierarchy levels
13.1.3 Volatile and Non Volatile
Storage



A volatile device forgets what data is stored on
it after power off
Non volatile holds data for longer period even
when device is turned off
All the secondary and tertiary devices are non
volatile and main memory is volatile
13.1.4 Virtual Memory

When we write programs the data we use, variables of the
program, files read and so on occupies a virtual memory
address space.

Typical software executes in virtual memory
Address space is typically 32 bit or 2^32 bytes or 4GB
Transfer between memory and disk is in terms of blocks


13.2.1 Mechanism of Disk

Mechanisms of Disks


Use of secondary storage is one of the important characteristic of DBMS
Disk device consists of 2 moving pieces








1. disk assembly
2. head assembly
Disk assembly consists of 1 or more platters
Platters rotate around a central spindle
Bits are stored on upper and lower surfaces of platters
The two principal moving pieces of a disk drive are a disk
assembly and a head assembly.
The disk assembly consists of one or more circular platters
that rotate around a central spindle
The upper and lower surfaces of the platters are covered
with a thin layer of magnetic material, on which bits are
stored.
13.2.1 Mechanism of Disk





Disk is organized into tracks
The track that are at fixed radius from center form one cylinder
Tracks are organized into sectors
Sectors are the segments of circle separated by gap
0’s and 1’s are represented by different patterns in the
magnetic material.
A common diameter for the disk platters is 3.5 inches.
The disk is organized into tracks, which are concentric
circles on a single platter.
The tracks that are at a fixed radius from a center,
among all the surfaces form one cylinder.
13.2.2 Disk Controller


One or more disks are controlled by disk
controllers
Disks controllers are capable of
 Controlling
the mechanical actuator that moves
the head assembly
 Selecting the sector from among all those in the
cylinder at which heads are positioned
 Transferring bits between desired sector and
main memory
 Possible buffering of entire track
13.2.3 Disk Access
Characteristics

Accessing (reading/writing) a block requires 3
steps
 Disk
controller positions the head assembly at the
cylinder containing the track on which the block is
located. It is a ‘seek time’
 The disk controller waits while the first sector of
the block moves under the head. This is a
‘rotational latency’
 All the sectors and the gaps between them pass
the head, while disk controller reads or writes
data in these sectors. This is a ‘transfer time’
13.3 Accelerating Access to
Secondary Storage
18

Several approaches for more-efficiently accessing
data in secondary storage:







Place blocks that are together in the same cylinder.
Divide the data among multiple disks.
Mirror disks.
Use disk-scheduling algorithms.
Prefetch blocks into main memory.
Scheduling Latency is defined as – added delay
in accessing data caused by a disk scheduling
algorithm.
Throughput is defined as – the number of disk
accesses per second that the system can
accommodate.
13.3 Accelerating Access to
Secondary Storage

Several approaches for more-efficiently accessing
data in secondary storage:







Place blocks that are together in the same cylinder.
Divide the data among multiple disks.
Mirror disks.
Use disk-scheduling algorithms.
Prefetch blocks into main memory.
Scheduling Latency – added delay in accessing
data caused by a disk scheduling algorithm.
Throughput – the number of disk accesses per
second that the system can accommodate.
13.3.1 The I/O Model of Computation

The number of block accesses (Disk I/O’s) is a
good approximation to the time needed by the
algorithm.


This should be minimized.
Ex 13.3: You want to have an index on R to
identify the block on which the desired tuple
appears, but not where on the block it resides.
For Megatron 747 example, it takes 11ms to read
a 16k block.
 A standard microprocessor can execute millions
of instruction in 11ms.

13.3.2 Organizing Data by Cylinders


If we read all blocks on a single track or
cylinder consecutively, then we can neglect all
but first seek time and first rotational latency.
Ex 13.4: We request 1024 blocks of M747.
If data is randomly distributed, average latency is
10.76ms by Ex 13.2, making total latency 11s.
 If all blocks are consecutively stored on 1
cylinder:

 6.46ms
+ 8.33ms * 16 = 139ms
(1 average seek) (time per rotation)
(# rotations)
13.1.3 Using Multiple Disks


If we have n disks, read/write performance will
increase by a factor of n.
Striping – distributing a relation across multiple
disks following this pattern:




ADD
PIC
HERE
Data on disk R1: R1, R1+n, R1+2n, …
Data on disk R2: R2, R2+n, R2+2n, …
…
Data on disk Rn: Rn, Rn+n, Rn+2n, …
Ex 13.5: We request 1024 blocks with n = 4.

6.46ms + (8.33ms * (16/4)) = 39.8ms
(1 average seek) (time per rotation)
(# rotations)
13.1.4 Mirroring Disks




Mirroring Disks – having 2 or more disks hold
identical copied of data.
Benefit 1: If n disks are mirrors of each other,
the system can survive a crash by n-1 disks.
Benefit 2: If we have n disks, read
performance increases by a factor of n.
Performance increases further by having the
controller select the disk which has its head
closest to desired data block for each read.
13.1.5 Disk Scheduling and the
Elevator Problem


Disk controller will run this algorithm to select
which of several requests to process first.
Pseudo code:
requests[] // array of all non-processed data
requests
 Upon receiving new data request:

 requests[].add(new
request)
 while(requests[] is not empty)



Go until data is requested
If at end, reverse direction
requests[] = all current requests
13.1.5 Disk Scheduling and the
Elevator Problem (con’t)
Events:
Head starting point
Request data at
8000
Request data at
24000
Request data at
56000
Get data at 8000
Request data at
16000
Get data at 24000
Request data at
64000
Get data at 56000
Request Data at
40000
Get data at 64000
Get data at 40000
64000
56000
48000
40000
32000
24000
16000
8000
Current
time
13.6
26.9
34.2
45.5
56.8
4.3
10
20
30
0
data
time
8000..
4.3
24000..
13.6
56000..
26.9
64000..
34.2
40000..
45.5
16000..
56.8
13.1.5 Disk Scheduling and the
Elevator Problem (con’t)
data
time
data
time
8000..
4.3
8000..
4.3
24000..
13.6
24000..
13.6
56000..
26.9
56000..
26.9
64000..
34.2
16000..
42.2
40000..
45.5
64000..
59.5
16000..
56.8
40000..
70.8
13.1.6 Prefetching and Large-Scale
Buffering

If at the application level, we can predict the
order blocks will be requested, we can load
them into main memory before they are
needed.
13.3.1 The I/O Model of
Computation
28

The number of block accesses (Disk I/O’s) is a
good time approximation for the algorithm.


This should be minimized.
Ex 13.3: To have an index on R to identify the
block on which the desired tuple appears, but
not where on the block it resides.
For Megatron 747 (M747) example, it takes 11ms
to read a 16k block.
 A standard microprocessor can execute millions
of instruction in 11ms, making any delay in
searching for the desired tuple negligible.

13.3.2 Organizing Data by
Cylinders
29


Read all blocks on a single track or cylinder
consecutively, then we can neglect all but first
seek time and first rotational latency.
Ex 13.4: We request 1024 blocks of M747.
If data is randomly distributed, average latency is
10.76ms by Ex 13.2, making total latency 11s.
 If all blocks are consecutively stored on 1
cylinder:

 6.46ms
+ 8.33ms * 16 = 139ms
(1 average seek) (time per rotation)
(# rotations)
13.3.3 Using Multiple Disks
30


If we have n disks, read/write performance will
increase by a factor of n.
Striping – distributing a relation across multiple
disks following this pattern:




Data on disk R1: R1, R1+n, R1+2n,…
Data on disk R2: R2, R2+n, R2+2n,…
…
Data on disk Rn: Rn, Rn+n, Rn+2n, …
Ex 13.5: We request 1024 blocks with n = 4.

6.46ms + (8.33ms * (16/4)) = 39.8ms
(1 average seek) (time per rotation)
(# rotations)
13.3.4 Mirroring Disks
31




Mirroring Disks – having 2 or more disks hold
identical copied of data.
Benefit 1: If n disks are mirrors of each other,
the system can survive a crash by n-1 disks.
Benefit 2: If we have n disks, read
performance increases by a factor of n.
Performance can be increases further by
having the controller select the disk which has
its head closest to desired data block for each
read.
13.3.5 Disk Scheduling and the
Elevator Problem
32


Disk controller will run this algorithm to select
which of several requests to process first.
Pseudo code:


requests[] // array of all non-processed data requests
upon receiving new data request:


requests[].add(new request)
while(requests[] is not empty)
move head to next location
 if(head location is at data in requests[])




retrieve data
remove data from requests[]
if(head reaches end)

reverse head direction
13.3.5 Disk Scheduling and the
Elevator Problem (con’t)
33
Events:
Head starting point
Request data at 8000
Request data at 24000
Request data at 56000
Get data at 8000
Request data at 16000
Get data at 24000
Request data at 64000
Get data at 56000
Request Data at 40000
Get data at 64000
Get data at 40000
Get data at 16000
64000
56000
48000
40000
32000
24000
16000
8000
Current
time
13.6
26.9
34.2
45.5
56.8
4.3
10
20
30
0
data
time
8000..
4.3
24000..
13.6
56000..
26.9
64000..
34.2
40000..
45.5
16000..
56.8
13.3.5 Disk Scheduling and the
Elevator Problem (con’t)
34
Elevator
Algorithm
data
time
FIFO
Algorithm
data
time
8000..
4.3
8000..
4.3
24000..
13.6
24000..
13.6
56000..
26.9
56000..
26.9
64000..
34.2
16000..
42.2
40000..
45.5
64000..
59.5
16000..
56.8
40000..
70.8
13.3.6 Prefetching and Large-Scale
Buffering
35

If at the application level, we can predict the
order blocks will be requested, we can load
them into main memory before they are
needed.
13.3 Accelerating Access to
Secondary Storage
36

Several approaches for more-efficiently accessing
data in secondary storage:







Place blocks that are together in the same cylinder.
Divide the data among multiple disks.
Mirror disks.
Use disk-scheduling algorithms.
Prefetch blocks into main memory.
Scheduling Latency is defined as – added delay
in accessing data caused by a disk scheduling
algorithm.
Throughput is defined as – the number of disk
accesses per second that the system can
accommodate.
13.3.1 The I/O Model of
Computation
37

The number of block accesses (Disk I/O’s) is a
good time approximation for the algorithm.


This should be minimized.
Ex 13.3: To have an index on R to identify the
block on which the desired tuple appears, but
not where on the block it resides.
For Megatron 747 (M747) example, it takes 11ms
to read a 16k block.
 A standard microprocessor can execute millions
of instruction in 11ms, making any delay in
searching for the desired tuple negligible.

13.3.2 Organizing Data by
Cylinders
38


Read all blocks on a single track or cylinder
consecutively, then we can neglect all but first
seek time and first rotational latency.
Ex 13.4: We request 1024 blocks of M747.
If data is randomly distributed, average latency is
10.76ms by Ex 13.2, making total latency 11s.
 If all blocks are consecutively stored on 1
cylinder:

 6.46ms
+ 8.33ms * 16 = 139ms
(1 average seek) (time per rotation)
(# rotations)
13.3.3 Using Multiple Disks
39


If we have n disks, read/write performance will
increase by a factor of n.
Striping – distributing a relation across multiple
disks following this pattern:




Data on disk R1: R1, R1+n, R1+2n,…
Data on disk R2: R2, R2+n, R2+2n,…
…
Data on disk Rn: Rn, Rn+n, Rn+2n, …
Ex 13.5: We request 1024 blocks with n = 4.

6.46ms + (8.33ms * (16/4)) = 39.8ms
(1 average seek) (time per rotation)
(# rotations)
13.3.4 Mirroring Disks
40




Mirroring Disks – having 2 or more disks hold
identical copied of data.
Benefit 1: If n disks are mirrors of each other,
the system can survive a crash by n-1 disks.
Benefit 2: If we have n disks, read
performance increases by a factor of n.
Performance can be increases further by
having the controller select the disk which has
its head closest to desired data block for each
read.
13.3.5 Disk Scheduling and the
Elevator Problem
41


Disk controller will run this algorithm to select
which of several requests to process first.
Pseudo code:


requests[] // array of all non-processed data requests
upon receiving new data request:


requests[].add(new request)
while(requests[] is not empty)
move head to next location
 if(head location is at data in requests[])




retrieve data
remove data from requests[]
if(head reaches end)

reverse head direction
13.3.5 Disk Scheduling and the
Elevator Problem (con’t)
42
Events:
Head starting point
Request data at 8000
Request data at 24000
Request data at 56000
Get data at 8000
Request data at 16000
Get data at 24000
Request data at 64000
Get data at 56000
Request Data at 40000
Get data at 64000
Get data at 40000
Get data at 16000
64000
56000
48000
40000
32000
24000
16000
8000
Current
time
13.6
26.9
34.2
45.5
56.8
4.3
10
20
30
0
data
time
8000..
4.3
24000..
13.6
56000..
26.9
64000..
34.2
40000..
45.5
16000..
56.8
13.3.5 Disk Scheduling and the
Elevator Problem (con’t)
43
Elevator
Algorithm
data
time
FIFO
Algorithm
data
time
8000..
4.3
8000..
4.3
24000..
13.6
24000..
13.6
56000..
26.9
56000..
26.9
64000..
34.2
16000..
42.2
40000..
45.5
64000..
59.5
16000..
56.8
40000..
70.8
13.3.6 Prefetching and Large-Scale
Buffering
44

If at the application level, can predict the order
blocks will be requested, can load them into
main memory before they are needed.
Ways in which disks can fail45

Intermittent failure.

Media Decay.

Write failure.

Disk Crash.
46
13.4
Intermittent Failures.
47




Read or write operation on a sector
unsuccessful not on first try, but after repeated
tries.
The most common form of failure.
Parity checks can be used to detect this kind of
failure.
A straightforward way to perform the check
is to read the sector and compare it with the
sector we intended to write.
Media Decay.
48

Serious form of failure.

Bit/Bits are permanently corrupted.

Impossible to read a sector correctly even after
many trials.

Stable storage technique for organizing a disk
is used to avoid this failure or failed to writes
do not result in permanent loss.
Write failure
49

Attempt to write a sector is not possible.

Attempt to retrieve previously written sector is
unsuccessful.


Possible reason – power outage while writing
of the sector.
Stable Storage Technique can be used to
avoid this.
Disk Crash
50
•
Most serious form of disk failure.
•
Entire disk becomes unreadable, suddenly
and permanently.
•
RAID techniques can be used for coping with
disk crashes.
More on Intermittent failures…
51

When we try to read a sector, but the correct
content of that sector is not delivered to the
disk controller.

If the controller has a way to tell that the sector
is good or bad (checksums), it can then
reissue the read request when bad data is
read, until the sector is returned correctl, or
some preset limit like 100 tries, is reached.
More on Intermittent Failures..
52

The controller can attempt to write a sector,
but the contents of the sector are not what was
intended.

The only way to check this is to let the disk go
around again read the sector.

One way to perform the check is to read the
sector and compare it with the sector we
intend to write.
Contd..
53

Instead of performing the complete
comparison at the disk controller, simpler way
is to read the sector and see if a good sector
was read.

If it is good sector, then the write was correct
otherwise the write was unsuccessful and
must be repeated.
Checksums.
54

Technique used to determine the good/bad
status of a sector.

Each sector has some additional bits called
the checksum that are set depending on the
values of the data bits in that sector.

If checksum is not proper on reading, then
there is an error in reading.
Checksums(contd..)
55

There is a small chance that the block was not
read correctly even if the checksum is proper.

The probability of correctness can be
increased by using many checksum bits.
If on reading we find that the checksum is
not proper for the data bits, then we know
there is an error in reading.

Checksum calculation.
56



Checksum is based on the parity of all bits in
the sector.
If there are odd number of 1’s among a
collection of bits, the bits are said to have odd
parity. A parity bit ‘1’ is added.
If there are even number of 1’s then the
collection of bits is said to have even parity. A
parity bit ‘0’ is added.
Checksum calculation(contd..)
57

The number of 1’s among a collection of bits
and their parity bit is always even.

During a write operation, the disk controller
calculates the parity bit and append it to the
sequence of bits written in the sector.

Every sector will have a even parity.
Examples…
58

A sequence of bits 01101000 has odd number
of 1’s. The parity bit will be 1. So the
sequence with the parity bit will now be
011010001.

A sequence of bits 11101110 will have an even
parity as it has even number of 1’s. So with the
parity bit 0, the sequence will be 111011100.
Checksum calculation(contd..)
59


Any one-bit error in reading or writing the bits
results in a sequence of bits that has oddparity.
The disk controller can count the number of 1’s
and can determine if the sector has odd parity
in the presence of an error.
Odds.
60

There are chances that more than one bit can
be corrupted and the error can be unnoticed.

Increasing the number of parity bits can
increase the chances of detecting errors.

In general, if there are n independent bits as
checksum, the chances of error will be one in
2n.
Stable Storage.
61




While checksum will almost certainly
detect the existence of a media failure or a
failure to read or write correctly, it does not
help us correct the error.
i.e. Checksums can detect the error but cannot
correct it.
Sometimes we overwrite the previous contents
of a sector and yet cannot read the new
contents correctly.
To deal with these problems, Stable Storage
policy can be implemented on the disks.
Stable-Storage(contd..)
62



To deal with the problems above, we
implement the Stable-storage
Sectors are paired and each pair represents
one sector-contents X.
The left copy of the sector may be represented
as XL and XR as the right copy.
Assumptions.
63

We assume that copies are written with
sufficient number of parity bits to decrease the
chance of bad sector looks good when the
parity checks are considered.

Also, If the read function returns a good value
w for either XL or XR then it is assumed that w
is the true value of X.
Stable -Storage Writing Policy:
64
1.
Write the value of X into XL. Check the value
has status “good”; i.e., the parity-check bits
are correct in the written copy. If not repeat
write. If after a set number of write attempts,
we have not successfully written X in XL,
assume that there is a media failure in this
sector. A fix-up such as substituting a spare
sector for XL must be adopted.
2.
Repeat (1) for XR.
Stable-Storage Reading Policy:
65

The policy is to alternate trying to read XL and
XR until a good value is returned.

If a good value is not returned after pre chosen
number of tries, then it is assumed that X is
truly unreadable.
Error-Handling capabilities:
66
Media failures:
•
If after storing X in sectors XL and XR, one of
them undergoes media failure and becomes
permanently unreadable, we can read from the
second one.
•
If both the sectors have failed to read, then
sector X cannot be read.
•
The probability of both failing is extremely
Error-Handling
Capabilities(contd..)
67
Write Failure:
• When writing X, if there is a system failure(like
power shortage), the X in the main memory is
lost and the copy of X being written will be
erroneous.
•
Half of the sector may be written with part of
new value of X, while the other half remains as
it was.
Error-Handling
Capabilities(contd..)
68
The possible cases when the system becomes
available:
1. The failure occurred when writing to XL. Then
XL is considered bad. Since XR was never
changed, its status is good. We can make a
copy of XR into XL, which is the old value of X.
2. The failure occurred after XL is written. Then
XL will have the good status and XR which
has the old value of XR has bad status. We
can copy the new value of X to XR from XL.

Recovery from Disk Crashes.
69

To reduce the data loss by Dish crashes,
schemes which involve redundancy, extending
the idea of parity checks or duplicate sectors
can be applied.

The term used for these strategies is RAID or
Redundant Arrays of Independent Disks.

In general, if the mean time to failure of disks
is n years, then in any given year, 1/nth of the
surviving disks fail.
Recovery from Disk
Crashes(contd..)
70

Each of the RAID schemes has data disks and
redundant disks.

Data disks are one or more disks that hold the data.

Redundant disks are one or more disks that hold
information that is completely determined by the
contents of the data disks.

When there is a disk crash of either of the disks, then
the other disks can be used to restore the failed disk
to avoid a permanent information loss.
71
13.4.6 DISK FAILURES
Xiaqing He
ID: 204
Dr. Lin
1) Mirroring
72



The simplest scheme to recovery from Disk
Crashes
How does Mirror work?
-- making two or more copied of the data
on different disks
Benefit:
-- save data in case of one disk will fail;
-- divide data on several disks and let access
to several blocks at once
1) Mirroring (con’t)
73

For mirroring, when the data can be lost?
-- the only way data can be lost if there is a second
(mirror/redundant) disk crash while the first (data) disk crash
is being repaired.
Possibility:
Suppose:
 One disk: mean time to failure = 10 years;
 One of the two disk: average of mean time to failure = 5
years;
 The process of replacing the failed disk= 3 hours=1/2920
year;
So:
 the possibility of the mirror disk will fail=1/10 * 1/2,920
=1/29,200;
 The possibility of data loss by mirroring: 1/5 * 1/29,200 =

2)Parity Blocks
74

why changes?
-- disadvantages of Mirroring:
uses so many
redundant disks
What’s new?
-- RAID level 4: uses only one redundant disk


How this one redundant disk works?
-- modulo-2 sum;
-- the jth bit of the redundant disk is the modulo-2
sum of the jth bits of all the data disks.

Example
75
2)Parity
Blocks(con’t)___Example
Data disks:
 Disk1: 11110000
 Disk2: 10101010
 Disk3: 00111000
Redundant disk:
 Disk4: 01100010
2)RAID 4 (con’t)
76
Reading
-- Similar with reading blocks from any disk;

Writing
1)change the data disk;
2)change the corresponding block of the redundant
disk;


Why?
-- hold the parity checks for the corresponding
blocks of all the data disks
2)RAID 4 (con’t) _ writing
77
For a total N data disks:
1) naïve way:

read N data disks and compute the modulo-2 sum of
the corresponding blocks;

rewrite the redundant disk according to modulo-2
sum of the data disks;
2) better way:
 Take modulo-2 sum of the old and new version of the
data block which was rewritten;
 Change the position of the redundant disk which was
1’s in the modulo-2 sum;
2)RAID 4 (con’t) _
writing_Example
78









Data disks:
Disk1: 11110000
Disk2: 10101010  01100110
Disk3: 00111000
to do:
Modulo-2 sum of the old and new version of disk 2:
11001100
So, we need to change the positions 1,2,5,6 of the redundant
disk.
Redundant disk:
Disk4: 01100010  10101110
2)RAID 4 (con’t) _failure recovery
79

Redundant disk crash:
-- swap a new one and recomputed data from all the data disks;

One of Data disks crash:
-- swap a new one;
-- recomputed data from the other disks including data disks and
redundant disk;

How to recomputed? (same rule, that’s why there will be some
improvement)
-- take modulo-2 sum of all the corresponding bits of all the
other disks
3) An Improvement: RAID 5
80

Why need a improvement?
-- Shortcoming of RAID level 4: suffers from a bottleneck defect (when
updating data disk need to read and write the redundant disk);

Principle of RAID level 5 (RAID 5):
-- treat each disk as the redundant disk for some of the blocks;
Why it is feasible?
The rule of failure recovery for redundant disk and data disk is the
same:

“take modulo-2 sum of all the corresponding bits of all the other
disks”
So, there is no need to retreat one as redundant disk and others as
data disks
3) RAID 5 (con’t)
81

How to recognize which blocks of each disk treat
this disk as redundant disk?
-- if there are n+1 disks which were labeled from
0 to N, then we can treat the i cylinder of disk
J as redundant if J is the remainder when I is
divided by n+1;
th

Example;
3) RAID 5 (con’t)_example
82
N=3;
 The first disk, labeled as 0 : 4,8,12…;
 The second disk, labeled as 1 : 1,5,9…;
 The third disk, labeled as 2 : 2,6,10…;
 ……….


Suppose all the 4 disks are equally likely to be
written, for one of the 4 disks, the possibility of
being written:
1/4 + 3 /4 * 1/3 =1/2
If N=m => 1/m +(m-1)/m * 1/(m-1) = 2/m
83
4) Coping with multiple disk
crashes
RAID 6
– deal with any number of disk crashes if using
enough redundant disks
 Example
a system of seven disks ( four data disks_numer 14 and 3 redundant disks_ number 5-7);
• How to set up this 3*7 matrix ?
(why is 3? – there are 3 redundant disks)
1)every column values three 1’s and 0’s except for all
three 0’s;
2) column of the redundant disk has single 1’s;
3) column of the data disk has at least two 1’s;

4) Coping with multiple disk crashes
(con’t)
84





Reading:
read form the data disks and ignore the
redundant disk
Writing:
Change the data disk
change the corresponding bits of all the
redundant disks
4) Coping with multiple disk crashes
(con’t)
85






In those system which has 4 data disks and 3
redundant disk, how they can correct up to 2 disk
crashes?
Suppose disk a and b failed:
find some row r (in 3*7 matrix)in which the column
for a and b are different (suppose a is 0’s and b is
1’s);
Compute the correct b by taking modulo-2 sum of
the corresponding bits from all the other disks
other than b which have 1’s in row r;
After getting the correct b, Compute the correct a
with all other disks available;
Example
4) Coping with multiple disk crashes
(con’t)_example
86
3*7 matrix
data disk
redundant disk
disk number
1
2
3
4
5
6
1
1
1
0
1
0
0
1
1
0
1
0
1
0
1
0
1
1
0
0
1
7
4) Coping with multiple disk crashes
(con’t)_example
87
First block of all the disks
disk
1)
2)
3)
4)
5)
6)
7)
contents
11110000
10101010
00111000
01000001
01100010
00011011
10001001
4) Coping with multiple disk crashes
(con’t)_example
88
Two disks crashes;
disk
1)
2)
3)
4)
5)
6)
7)
contents
11110000
?????????
00111000
01000001
?????????
00011011
10001001
4) Coping with multiple disk crashes
(con’t)_example
89
In that 3*7 matrix, find in row 2, disk 2 and 5 have
different value and disk 2’s value is 1 and 5’s value is 0.
so: compute the first block of disk 2 by modulo-2 sum
of all the corresponding bits of disk 1,4,6;
then compute the first block of disk 2 by modulo-2
sum of all the corresponding bits of disk 1,2,3;
1)
11110000
2)
????????? => 00001111
3)
00111000
4)
01000001
5)
????????? => 01100010
6)
00011011
7)
10001001
13.5 Arranging data on disk
90
Records
Data elements are represented as records, which stores in consecutive bytes
in same disk block.
Basic layout techniques of storing data :
Fixed-Length Records
Allocation criteria - data should start at word boundary.
Fixed Length record header
1. A pointer to record schema.
2. The length of the record.
3. Timestamps to indicate last modified or last read.
91
Example
CREATE TABLE employee(
name CHAR(30) PRIMARY KEY,
92 address
VARCHAR(255),
gender CHAR(1),
birthdate DATE
);
Following information should be there in the record.
1. The record schema
2. The length of the record
3. Timestamps

many record layouts include a header of some small number of bytes to provide this
additional information.
Data should start at word boundary and contain header and four fields name, address, gender and
birthdate.
Packing Fixed-Length Records into
Blocks :
93
Records are stored in the form of blocks on the disk and
they move into main memory when we need to update
or access them.
A block header is written first, and it is followed by series
of blocks.
Block header contains the following
information :
94





Links to one or more blocks that are part of a network
of blocks.
Information about the role played by this block in such
a network.
Information about the relation, the tuples in this block
belong to.
A "directory" giving the offset of each record in the
block.
Time stamp(s) to indicate time of the block's last
modification and/or access.
95
Example
Along with the header we can pack as many record as
we can
in one block as shown in the figure and remaining
space will
be unused.
96
13.6 REPRESENTING
BLOCK AND RECORD
ADDRESSES
97

Address of a block and Record
 In
Main Memory

Address of the block is the virtual memory address of
the first byte
 Address of the record within the block is the virtual
memory address of the first byte of the record
 In
Secondary Memory: sequence of bytes
describe the location of the block in the
overall system

Sequence of Bytes describe the location
of the block : the device Id for the disk,
Cylinder number, etc.
Addresses in Client-Server
Systems
98

Database consists of a server process that provides data from secondary storage
to one or more client processes that are applications using the data.

The server and client processes may be on one machine, or the server and the
various clients can be distributed over many machines.

The client application uses a "virtual" address space.

The operating system or DBMS decides which parts of the address space are
currently located in main memory, and hardware maps the virtual address space
to physical locations in main memory.

The addresses in address space are represented in two ways


Physical Addresses: byte strings that determine the place within the secondary
storage system where the record can be found.

Logical Addresses: arbitrary string of bytes of some fixed length
Physical Address bits are used to indicate:

Host to which the storage is attached

Identifier for the disk

Number of the cylinder

Number of the track

Offset of the beginning of the record
ADDRESSES IN CLIENT-SERVER
SYSTEMS (CONTD..)
99

Map Table relates logical addresses to
physical addresses.
Logical
Physical
Logical Address
Physical Address
Logical and Structured
Addresses
100

Purpose of logical address?

All the information needed for a physical address is found in the map table.

Many combinations of logical and physical addresses yield structured address
schemes.

A very useful, combination of physical and logical addresses is to keep in each
block an offset table that holds the offsets of the records within the block, as
suggested in Fig .

Gives more flexibility, when we


Move the record around within the block

Move the record to another block
Gives us an option of deciding what to do when a record is deleted?
Rec
ord
4
Offset table
Header
Unused
Rec
ord
3
Rec
ord
2
Rec
ord
1
Pointer Swizzling
101





Relational systems need the ability to represent
pointers in tuples
index structures are composed of blocks that
usually have pointers within them
Having pointers is common in an object-relational
database systems
Important to learn about the management of
pointers
Every data item (block, record, etc.) has two
addresses:
 database address: address on the disk
 memory address, if the item is in virtual memory
Pointer Swizzling (Contd…)
102

Translation Table: Maps database address to
memory address
Dbaddr
Mem-addr
Database address
Memory Address

All addressable items in the database have entries
in the map table, while only those items currently in
memory are mentioned in the translation table
Pointer Swizzling (Contd…)
103

Pointer consists of the following two fields

Bit indicating the type of address

Database or memory address

Example 13.17
Disk
Memory
Swizzled
Block 1
Block 1
Unswizzled
Block 2
Example 13.7
104


Block 1 has a record with pointers to a
second record on the same block and to a
record on another block
If Block 1 is copied to the memory
 The
first pointer which points within Block 1
can be swizzled so it points directly to the
memory address of the target record
 Since Block 2 is not in memory, we cannot
swizzle the second pointer
Pointer Swizzling (Contd…)
105

Three types of swizzling
 Automatic
Swizzling
 As
soon as block is brought into memory, swizzle
all relevant pointers.
 Swizzling
on Demand
 Only
swizzle a pointer if and when it is actually
followed.

No Swizzling
 Pointers
are not swizzled they are accesses
using the database address.
Programmer Control of
Swizzling
106



Unswizzling
 When a block is moved from memory back to disk, all
pointers must go back to database (disk) addresses
 Use translation table again
 Important to have an efficient data structure for the
translation table
 Drawbacks
The possible time saved by swizzling all of a block‘s
pointers at one time must be weighed against the
possibility that some swizzled pointers will never be
followed.
In that case, any time spent swizzling and unswizzling the
pointer will be wasted.
Pinned records and Blocks
107



A block in memory is said to be pinned if it cannot
be written back to disk safely.
A bit telling whether or not a block is pinned
can be located in the header of the block.
If block B1 has swizzled pointer to an item in block
B2, then B2 is pinned
 Unpin a block, we must unswizzle any pointers
to it
 Keep in the translation table the places in
memory holding swizzled pointers to that item
 Unswizzle
those pointers (use translation
table to replace the memory addresses with
database (disk) addresses
108
13.7
CHAPTER 13
ID: 221
109
Records With Variable-Length
Fields
A simple but effective scheme is to put all fixed length
fields ahead of the variable-length fields. We then place
in the record header:
1. The length of the record.
2. Pointers to (i.e., offsets of) the beginnings of all the
variable-length fields. However, if the variable-length
fields always appear in the same order then the first of
them needs no pointer; we know it immediately follows
the fixed-length fields.
Example
110
name
0
297
addres
s
30
gender
286
birth date
287
Fig 1 : Movie star record with four
fields
Records with Variable Fields
111
An effective way to represent variable
length records is as follows
 Fixed length fields are Kept ahead of the
variable length fields
Record header contains
• Length of the record
• Pointers to the beginning of all variable
length fields except the first one.
112
Records with Variable Length
Fields
header
information
record length
to address
gender birth date
name
address
Figure 2 : A Movie Star record with name and address
implemented as variable length character strings
Records with Repeating Fields
113
Records contains variable number of occurrences of a
field F
All occurrences of field F are grouped together and the
record
header contains a pointer to the first occurrence of field
F
 L bytes are devoted to one instance of field F
 Locating an occurrence of field F within the record
• Add to the offset for the field F which are the integer
multiples of L starting with 0 , L ,2L,3L and so on to
locate
•We stop upon reaching the offset of the field F.
Records with Repeating Fields
114
other header
information
record length
to address
to movie
pointers
name
address
pointers to movies
Figure 3 : A record with a repeating group of references to
movies
Records with Repeating Fields
115
record header to name length of name
information
to address
length of
to movie
address
references number of
references
addres
s
name
Figure 4 : Storing variable-length fields separately from the
record
116
Records with Repeating Fields
Advantage
 Records with repeating fields are used to keep the
record itself fixed length allows record to be searched
more efficiently, minimizes the overhead in the block
headers, and allows records to be moved within or among
the blocks with minimum effort.
Disadvantage
 Storing variable length components on another block
increases the number of disk I/O’s needed to examine all
components of a record.
117
Records with Repeating Fields
A compromise strategy is to allocate a fixed
portion of the record for the repeating fields
 If the number of repeating fields is lesser than
allocated space, then there will be some
unused space
 If the number of repeating fields is greater than
allocated space, then extra fields are stored in
a
different location and
Pointer to that location and count of additional
occurrences is stored in the record
Variable Format Records
118
 Records that do not have fixed schema
 Variable format records are represented by sequence of
tagged fields
 Each of the tagged fields consist of information
• Attribute or field name
• Type of the field
• Length of the field
• Value of the field
 Why use tagged fields
• Information – Integration applications
• Records with a very flexible schema
Variable Format Records
119
code for name
code for string
type length
N
S 1
4
Clint
Eastwood
code for restaurant
owned
code for string type
length
R
S
1
6
Fig 5 : A record with tagged fields
Hog’s Breath Inn
Records that do not fit in a block
120
 When the length of a record is greater than block size
,then
then record is divided and placed into two or more
blocks
 Portion of the record in each block is referred to as a
RECORD FRAGMENT
 Record with two or more fragments is called
SPANNED RECORD
 Record that do not cross a block boundary is called
UNSPANNED RECORD
Spanned Records
121
Spanned records require the following extra
header information
• A bit indicates whether it is fragment or not
• A bit indicates whether it is first or last
fragment of
a record
• Pointers to the next or previous fragment for
the
same record
Records that do not fit in a block
122
block header
record
header
record 1
block 1
record
2-a
record
2-b
record 3
block 2
Figure 6 : Storing spanned records across
blocks
BLOBS
123
•
•
•
•
Binary, Large OBjectS = BLOBS
BLOBS can be images, movies, audio files and other very
large values that can be stored in files.
Storing BLOBS
– Stored in several blocks.
– Preferable to store them consecutively on a cylinder or
multiple disks for efficient retrieval.
Retrieving BLOBS
– A client retrieving a 2 hour movie may not want it all at
the same time.
– Retrieving a specific part of the large data requires an
index structure to make it efficient. (Example: An index
by seconds on a movie BLOB.)
Record Modifications
Section 13.8
124
125

•
Record is a single, implicitly structured
data item in the database table. Record is
also called as Tuple.
Record Modification : Records Modified
when a data manipulation operation is
performed.
STRUCTURE OF A RECORD
126
•
RECORD STRUCTURE FOR A PERSON
TABLE
•
CREATE TABLE PERSON ( NAME CHAR(30), ADDRESS CHAR(256) ,
GENDER CHAR(1), BIRTHDATE CHAR(10));
TYPES OF RECORDS
127
•
FIXED LENGTH RECORDS

CREATE TABLE SJSUSTUDENT(STUDENT_ID

INT(9) NOT NULL , PHONE_NO INT(10) NOT NULL);
•


VARIABLE LENGTH RECORDS
CREATE TABLE SJSUSTUDENT(STUDENT_ID INT(9) NOT NULL,
NAME CHAR(100) ,ADDRESS CHAR(100) ,PHONE_NO INT(10) NOT
NULL);
Modification types
128

Insertion

Deletion

Update
Insertion

Insertion of records without order
Records can be placed in a block with empty space or in a new
block.
Insertion of records in fixed order


Space available in the block
No space available in the block (outside the block)
Structured address
Is a Pointer to a record from outside the block.
129
129
Inserting New Records
130
•
•
•
If Records are not required to be a particular
order, just find an empty block and place the
record in the block. eg: Heap Files
if the Records are to be Kept in a particular
Order(eg: sorted by primary key) ?
Then Locate appropriate block, check if space
is available in the block if yes place the record
in the block.
Insertion in fixed order
Space available within the block

Use of an offset table in the header of each block with pointers to the
location of each record in the block.

The records are slid within the block and the pointers in the offset table are
adjusted.
Offse
t
table
header
unuse
d
Record 4
131
Record 3
Record 2
Record 1
131
Insertion in fixed order
No space available within the block (outside the block)

Find space on a “nearby” block.
•
•

In case of no space available on a block, look at the following block in sorted
order of blocks.
If space is available in that block ,move the highest records of first block 1 to
block 2 and slide the records around on both blocks.
Create an overflow block
•
•
•
Records can be stored in overflow block.
Each block has place for a pointer to an overflow block in its header.
The overflow block can point to a second overflow block as shown below.
Block
B
132
Overflow
block for B
132
Deletion

133
Recover space after deletion

When using an offset table, the records can be slid around the block so
there will be an unused region in the center that can be recovered.

In case we cannot slide records, an available space list can be
maintained in the block header.

The list head goes in the block header and available regions hold the
links in the list.
133
Deletion
134

Use of tombstone

The tombstone is placed in a record in order to avoid pointers to the
deleted record to point to new records.

The tombstone is permanent until the entire database is
reconstructed.

If pointers go to fixed locations from which the location of the record is
found then we put the tombstone in that fixed location. (See examples)
A tombstone is a bit placed at first byte of deleted record to
indicate the record was deleted ( 0 – Not Deleted 1 – Deleted)
Where a tombstone is placed depends on the nature of the record
pointers.



Map table is used to translate logical record address to physical
address.
134
Deletion
135

Use of tombstone

If we need to replace records by tombstones, place the bit that serves
as the tombstone at the beginning of the record.

This bit remains the record location and subsequent bytes can be
reused for another record
Record 1
Record 2
Record 1 can be replaced, but the tombstone remains, record 2 has
no tombstone and can be seen when we follow a pointer to it.
135
Update

Fixed Length update
No effect on storage system as it occupies same space as before
update.

136
Variable length update
• If length increases, like insertion “slide the records”
• If length decreases, like deletion we update the spaceavailable list, recover the space/eliminate the overflow
blocks.
 Longer length
 Short length
136
Update
Variable length update (longer length)

Stored on the same block:



Stored on another block


137
Sliding records
Creation of overflow block.
Move records around that block
Create a new block for storing variable length fields.
137
Update
Variable length update (Shorter length)

Same as deletion


138
Recover space
Consolidate space.
138
Topics
139





Records with Variable Length Fields
Records with Repeating Fields
Variable Format Records
Records that do not fit in a block
BLOBS
B TREE
CHAPTER 14.2
141
14.7
DATABASE SYSTEMS – The Complete
BTREES & BITMAP
Book
INDEXES
Definition
142
A bitmap index for a field F is a collection of
bit-vectors of length n, one for each possible
value that may appear in that field F.[1]
What does that mean?
143

Assume relation R
with
2
attributes A and B.
 Attribute A is of type
Integer and B is of
type String.
 6 records, numbered
1 through 6 as shown.
A
B
1
30
foo
2
30
bar
3
40
baz
4
50
foo
5
40
bar
6
30
baz
Example Continued…
144

A bitmap for attribute B is:
Value
foo
bar
baz
Vector
100100
010010
001001
A
B
1
30
foo
2
30
bar
3
40
baz
4
50
foo
5
40
bar
6
30
baz
Where do we reach?
145


A bitmap index is a special kind of database index
that uses bitmaps.[2]
Bitmap indexes is considered to work well for
data such as gender, which has a small
number of distinct values, e.g., male and
female, but many occurrences of those
values.[2]
A little more…
146

A bitmap index for attribute A of relation R is:
A
collection of bit-vectors
 The number of bit-vectors = the number of distinct
values of A in R.
 The length of each bit-vector = the cardinality of R.
 The bit-vector for value v has 1 in position i, if the ith
record has v in attribute A, and it has 0 there if not.[3]


Records are allocated permanent numbers.[3]
There is a mapping between record numbers and record
addresses.[3]
Motivation for Bitmap Indexes
147

Very efficient when used for partial match queries.[3]

They offer the advantage of buckets [2]
 Where
we find tuples with several specified attributes
without first retrieving all the record that matched in
each of the attributes.

They can also help answer range queries [3]
Another Example
148
Multidimensional Array of multiple types
{(5,d),(79,t),(4,d),(79,d),(5,t),(6,a)}
5 = 100010
79 = 010100
4 = 001000
6 = 000001
d = 101100
t = 010010
a = 000001
Example Continued…
149
{(5,d),(79,t),(4,d),(79,d),(5,t),(6,a)}
Searching for items is easy, just AND together.
To search for (5,d)
5 = 100010
d = 101100
100010 AND 101100 = 100000
The location of the
record has been traced!
Compressed Bitmaps
150

Assume:
 The
number of records in R are n
 Attribute A has m distinct values in R


The size of a bitmap index on attribute A is m*n.
If m is large, then the number of 1’s will be around 1/m.
 Opportunity

to encode
A common encoding approach is called run-length
encoding.[1]
Run-length encoding
151



Represents runs
 A run is a sequence of i 0’s followed by a 1, by some suitable binary encoding
of the integer i.
A run of i 0’s followed by a 1 is encoded by:
 First computing how many bits are needed to represent i, Say k
 Then represent the run by k-1 1’s and a single 0 followed by k bits which
represent i in binary.
 The encoding for i = 1 is 01. k = 1
 The encoding for i = 0 is 00. k = 1
We concatenate the codes for each run together, and the sequence of bits is the
encoding of the entire bit-vector
Understanding with an Example
152


Let us decode the sequence 11101101001011
Staring at the beginning (left most bit):
 First run: The first 0 is at position 4, so k = 4. The next 4 bits are 1101,
so we know that the first integer is i = 13
 Second run: 001011
k
=1
i=0

Last run: 1011
k
=1
i=3

Our entire run length is thus 13,0,3, hence our bit-vector is:
0000000000000110001
Managing Bitmap Indexes
153
1) How to find a specific bit-vector for a value
efficiently?
2) After selecting results that match, how to
retrieve the results efficiently?
3) When data is changed, do you alter bitmap
index?
1) Finding bit vectors
154
 Think
of each bit-vector as a key to a value.[1]
 Any secondary storage technique will be efficient
in retrieving the values.[1]
 Create secondary key with the attribute value as a
search key [3]
 Btree
 Hash
2) Finding Records
155


Create secondary key with the record number as a search key
[3]
Or in other words,
 For
finding record k, create a secondary index using
the kth position as a search key.[1]
3) Handling Modifications
156
Two things to remember:
Record numbers can not changed once assigned
Changes to data file require changes to bitmap index
157
Deletion
 Tombstone replaces deleted record
 Corresponding bit is set to 0
158
Insertion
 Record is assigned next record number.
 A bit of value 0 or 1 is appended to each bit
vector
 If new record contains a new value of the
attribute, add one bit-vector.
159
Modification
 Change the bit corresponding to the old value
of the modified record to 0
 Change the bit corresponding to the new
value of the modified record to 1
 If the new value is a new value of A, then
insert a new bit-vector.
160
Section 14.7
Chapter 14
Structure
161

A balanced tree, meaning that all paths from the
leaf node have the same length.

There is a parameter n associated with each Btree
block. Each block will have space for n searchkeys
and n+1 pointers.

The root may have only 1 parameter, but all other
blocks most be at lea
B-tree organizes its blocks into a tree. The tree is balanced, meaning that all paths from the
root to a leaf have the same length. Typically, there are three layers in a B-tree: the root,
an intermediate layer, and leaves, but any number of layers is possible.
st half full.
Structure
162
•
There are three layers in binary trees- the root, an intermediate layer and leaves
•
In a B-Tree each block have space for n search-key values and n+1 pointers
●
A typical node >
● a typical interior
node would have
pointers pointing to
leaves with out
values
● a typical leaf would
have pointers point
to records
N search keys
N+1 pointers
Application
163



The search key of the Btree is the primary key for
the data file. That is, there is one key-pointer
pair in a leaf for every record of the data
file.
Data file is sorted by its primary key.
Data file is sorted by an attribute that is not a
key,and this attribute is the search key for the Btree.
Lookup
164
If at an interior node, choose the correct pointer to use. This is
done by comparing keys to search value.
Lookup
165
At a leaf node, choose the key that matches what
you are looking for and the pointer for that leads
to the data.
Insertion
166




For inserting, choose the correct leaf node to put
pointer to data.
If the node is full, create a new node and split keys
between the two.
Then recursively move up, if cannot create new
pointer to new node because full, create new node.
This would end with creating a new root node, if
the current root was full.
Deletion
167
Perform lookup to find node to delete and delete it.
If node is no longer half full, perform join on
adjacent node and recursively delete up, or key
move if that node is full and recursively change
pointer up.
Efficiency
168
Btrees allow lookup, insertion, and deletion of
records using very few disk I/Os.
Each level of a Btree would require one read. Then
you would follow the pointer of that to the next or
final read.
Efficiency
169
Three levels are sufficient for Btrees. Having each block have
255 pointers, 255^3 is about 16.6 million.
You can even reduce disk I/Os by keeping a level of a Btree in
main memory. Keeping the first block with 255 pointers
would reduce the reads to 2, and even possible to keep the
next 255 pointers in memory to reduce reads to 1.
Query Execution
170
Chapter 15
Section 15.1
Agenda
171








Query Processor and major parts of Query
processor
Physical-Query-Plan Operators
Scanning Tables
Basic approaches to locate the tuples of a
relation R
Sorting While Scanning Tables
Computation Model for Physical Operator
I/O Cost for Scan Operators
Iterators
What is a Query Processor
172



Group of components of a DBMS that
converts a user queries and datamodification commands into a sequence of
database operations
It also executes those operations
Must supply detail regarding how the query
is to be executed
Major parts of Query processor
173
Query Execution:
The algorithms
that manipulate
the data of the
database.
Focus on the
operations of
extended
relational
algebra.
Outline of Query Compilation
174
Query compilation
 Parsing : A parse tree for the
query is constructed
 Query Rewrite : The parse
tree is converted to an initial
query plan and transformed
into logical query plan (less
time)
 Physical Plan Generation :
Logical Q Plan is converted
into physical query plan by
selecting algorithms and order
of execution of these operator.
Physical-Query-Plan Operators
175


Physical operators are implementations of
the operator of relational algebra.
They can also be use in non relational
algebra operators like “scan” which scans
tables, that is, bring each tuple of some
relation into main memory
Scanning Tables
176


One of the basic thing we can do in a
Physical query plan is to read the entire
contents of a relation R.
Variation of this operator involves simple
predicate, read only those tuples of the
relation R that satisfy the predicate.
Scanning Tables
177
Basic approaches to locate the tuples of a
relation R
 Table
Scan
 Relation R is stored in secondary memory
with its tuples arranged in blocks
 It is possible to get the blocks one by one
 Index-Scan
 If there is an index on any attribute of
Relation R, we can use this index to get all
the tuples of Relation R
Sorting While Scanning Tables
178

Number of reasons to sort a relation
 Query could include an ORDER BY clause,
requiring that a relation be sorted.
 Algorithms to implement relational algebra
operations requires one or both arguments
to be sorted relations.
 Physical-query-plan operator sort-scan
takes a relation R, attributes on which the
sort is to be made, and produces R in that
sorted order
179
Computation Model for Physical
Operator



Physical-Plan Operator should be selected
wisely which is essential for good Query
Processor .
For “cost” of each operator is estimated by
number of disk I/O’s for an operation.
The total cost of operation depends on the
size of the answer, and includes the final
write back cost to the total cost of the query.
Parameters for Measuring
Costs
180

Parameters that affect the performance of a
query
 Buffer space availability in the main
memory at the time of execution of the
query
 Size of input and the size of the output
generated
 The size of memory block on the disk and
the size in the main memory also affects
the performance
Parameters for Measuring Costs
181





B: The number of blocks are needed to hold
all tuples of relation R.
Also denoted as B(R)
T:The number of tuples in relationR.
Also denoted as T(R)
V: The number of distinct values that appear in
a column of a relation R
V(R, a)- is the number of distinct values of
column for a in relation R
I/O Cost for Scan Operators
182



If relation R is clustered, then the number
of disk I/O for the table-scan operator is =
~B disk I/O’s
If relation R is not clustered, then the
number of required disk I/O generally is
much higher
A index on a relation R occupies many
fewer than B(R) blocks
That means a scan of the entire
relation R which takes at least B disk I/O’s
will require more I/O’s than the entire index
Iterators for Implementation of
Physical Operators
183



Many physical operators can be
implemented as an Iterator.
Three methods forming the iterator for an
operation are:
1. Open( ) :
 This method starts the process of getting
tuples
 It initializes any data structures needed to
perform the operation
184
Iterators for Implementation of
Physical Operators


2. GetNext( ):
 Returns the next tuple in the result
 If there are no more tuples to return,
GetNext returns a special value
NotFound
3. Close( ) :
 Ends the iteration after all tuples
 It calls Close on any arguments of the
operator
Query Execution
185
One-Pass Algorithms for
Database Operations (15.2)
One-Pass Algorithm Methods
186




Tuple-at-a-time, unary operations: (selection &
projection)
You should explain in detail
Full-relation, unary operations
You should explain in detail
Full-relation, binary operations (set & bag versions of
union)
You should explain in detail
One-Pass Algorithms for Tupleat-a-Time Operations
187

Tuple-at-a-time operations are selection and projection




read the blocks of R one at a time into an input buffer
perform the operation on each tuple
move the selected tuples or the projected tuples to the
output buffer
The disk I/O requirement for this process depends only
on how the argument relation R is provided.

If R is initially on disk, then the cost is whatever it takes to
perform a table-scan or index-scan of R.
188
A selection or projection being
performed on a relation R
189
One-Pass Algorithms for Unary,
full Relation Operations

Duplicate Elimination

To eliminate duplicates, we can read each block of R
one at a time, but for each tuple we need to make a
decision as to whether:
1.
2.

It is the first time we have seen this tuple, in which case
we copy it to the output, or
We have seen the tuple before, in which case we must
not output this tuple.
One memory buffer holds one block of R's tuples, and
the remaining M - 1 buffers can be used to hold a single
copy of every tuple.
190
Managing memory for a onepass duplicate-elimination
Duplicate Elimination
191

When a new tuple from R is considered, we compare it
with all tuples seen so far



if it is not equal: we copy both to the output and add it to
the in-memory list of tuples we have seen.
if there are n tuples in main memory: each new tuple takes
processor time proportional to n, so the complete operation
takes processor time proportional to n2.
We need a main-memory structure that allows each of
the operations:

Add a new tuple, and

Tell whether a given tuple is already there
Duplicate Elimination (…contd.)
192

The different structures that can be used for such main
memory structures are:


Hash table:Balanced binary search tree
 :Each
of these structure has some space
overhead in addition to the space needed to store
of the tuples. For example a main memory hash
table needs a bucket array and space for pointers
to link the tuples in a bucket
193
One-Pass Algorithms for Unary,
full Relation Operations

Grouping

The grouping operation gives us zero or more grouping
attributes and presumably one or more aggregated
attributes

If we create in main memory one entry for each group then
we can scan the tuples of R, one block at a time.

The entry for a group consists of values for the grouping
attributes and an accumulated value or values for each
aggregation.
Grouping
194

The accumulated value is:




For MIN(a) or MAX(a) aggregate, record minimum
/maximum value, respectively.
For any COUNT aggregation, add 1 for each tuple of
group.
For SUM(a), add value of attribute a to the accumulated
sum for its group, provided a is not NULL
AVG(a) is a hard case. We must maintain 2
accumulations: count of no. of tuples in the group &
sum of a-values of these tuples, Which is also
computed for a COUNT and Sum. After checking all
tuples of R, take quotient of sum & count to obtain
average.
195
One-Pass Algorithms for Binary
Operations



Binary operations include:
What is Binary operations ?
->which uses two operands for evaluations








Union
Intersection
Difference
Product
Join
Tell the difference between Set and Bag?
Set does not allow duplicate elements
Bag allows duplicate element
Set Union
196

We read S into M - 1 buffers of main memory and
build a search structure where the search key is the
entire tuple.

All these tuples are also copied to the output.

Read each block of R into the Mth buffer, one at a
time.

For each tuple t of R, see if t is in S, and if not, we
copy t to the output. If t is also in S, we skip t.
Set Intersection
197

Read S into M - 1 buffers and build a search structure
with full tuples as the search key.

Read each block of R, and for each tuple t of R, see if
t is also in S. If so, copy t to the output, and if not,
ignore t.
Set Difference
198




Read S into M - 1 buffers and build a search structure
with full tuples as the search key.
To compute R -s S, read each block of R and examine
each tuple t on that block. If t is in S, then ignore t; if it
is not in S then copy t to the output.
To compute S -s R, read the blocks of R and examine
each tuple t in turn. If t is in S, then delete t from the
copy of S in main memory, while if t is not in S do
nothing.
After considering each tuple of R, copy to the output
those tuples of S that remain.
Bag Intersection
199



Read S into M - 1 buffers.
Multiple copies of a tuple t are not stored individually.
Rather store 1 copy of t & associate with it a count
equal to no. of times t occurs.
Next, read each block of R, & for each tuple t of R
see whether t occurs in S. If not ignore t; it cannot
appear in the intersection. If t appears in S, & count
associated with t is (+)ve, then output t & decrement
count by 1. If t appears in S, but count has reached 0,
then do not output t; we have already produced as
many copies of t in output as there were copies in S.
Bag Difference
200

To compute S -B R, read tuples of S into main
memory & count no. of occurrences of each distinct
tuple.

Then read R; check each tuple t to see whether t
occurs in S, and if so, decrement its associated
count. At the end, copy to output each tuple in main
memory whose count is positive, & no. of times we
copy it equals that count.

To compute R -B S, read tuples of S into main
memory & count no. of occurrences of distinct tuples.
Bag Difference (…contd.)
201

Think of a tuple t with a count of c as c reasons not to
copy t to the output as we read tuples of R.

Read a tuple t of R; check if t occurs in S. If not, then
copy t to the output. If t does occur in S, then we look at
current count c associated with t. If c = 0, then copy t to
output. If c > 0, do not copy t to output, but decrement c
by 1.
Product
202

Read S into M - 1 buffers of main memory

Then read each block of R, and for each tuple t of R
concatenate t with each tuple of S in main memory.

Output each concatenated tuple as it is formed.

This algorithm may take a considerable amount of
processor time per tuple of R, because each such
tuple must be matched with M - 1 blocks full of tuples.
However, output size is also large, & time/output tuple
is small.
Natural Join
203

Convention: R(X, Y) is being joined with S(Y, Z), where
Y represents all the attributes that R and S have in
common, X is all attributes of R that are not in the
schema of S, & Z is all attributes of S that are not in the
schema of R. Assume that S is the smaller relation.

To compute the natural join, do the following:
1. Read all tuples of S & form them into a mainmemory search structure.
Hash table or balanced tree are good e.g. of such
structures. Use M - 1 blocks of memory for this
purpose.
Natural Join (…contd.)
204
2.
Read each block of R into 1 remaining mainmemory buffer.
For each tuple t of R, find tuples of S that agree
with t on all attributes of Y, using the search
structure.
For each matching tuple of S, form a tuple by
joining it with t, & move resulting tuple to output.
205
15.3
Nested-Loop Joins
By:
Saloni Tamotia (215)
Introduction to Nested-Loop
Joins
206




Used for relations of any side.
Not necessary that relation fits in main memory
Uses “One-and-a-half” pass method in which
for each variation:
 One argument read just once.
 Other argument read repeatedly.
Can be used for relations of any size.
Two kinds:
 Tuple-Based Nested Loop Join
 Block-Based Nested Loop Join
207
ADVANTAGES OF NESTED-LOOP
JOIN
Fits in the iterator framework.
 Allows us to avoid storing
intermediate relation on disk.

Tuple-Based Nested-Loop Join
208
 Simplest variation of the
nested-loop join
 Loop ranges over individual
tuples
Tuple-Based Nested-Loop Join
209



Algorithm to compute the Join R(X,Y) | |
S(Y,Z)
FOR each tuple s in S DO
FOR each tuple r in R DO
IF r and s join to make tuple t THEN
output t
R and S are two Relations with r and s as
tuples.
carelessness in buffering of blocks causes
the use of T(R)T(S) disk I/O’s
IMPROVEMENT &
MODIFICATION
210
To decrease the cost

Method 1: Use algorithm for Index-Based
joins
 We find tuple of R that matches given tuple of
S
 We need not to read entire relation R

Method 2: Use algorithm for Block-Based
joins
Block-Based Nested-Loop Join
Algorithm
211

Access to arguments is organized by
block.
 While reading tuples of inner relation we
use less number of I/O’s disk.

Using enough space in main memory to
store tuples of relation of the outer loop.
 Allows to join each tuple of the inner
relation with as many tuples as possible.
212
Block-Based Nested-Loop Join
Algorithm
ALGORITHM:
FOR each chunk of M-1 blocks of S DO
 Read this blocks into main-memory buffers;
FOR each block b of R DO
 Read b into main memory
FOR each tuple t of b DO
 Find the tuples of S in main memory that join
with t;
find the tuples of S in memory that join with t
output the join of t with each of these tuples
213
Block-Based Nested-Loop Join
Algorithm

Assumptions:
≤ B(R)
 B(S) > M
 B(S)
This means that the neither relation fits
in the entire main memory.
Analysis of Nested-Loop Join
214

Number of disk I/O’s:
[B(S)/(M-1)]*(M-1 +B(R))
or
B(S) + [B(S)B(R)/(M-1)]
or approximately B(S)*B(R)/M



Assuming all of M.B(S),B(R) are large, but
M is the smallest of these,
~B(S)*B(R)/M.
i.e. the cost is proportional to the product
of the sizes of the two relations, divided by
the amount of available memory.
215
TWO-PASS ALGORITHMS
BASED ON SORTING
SECTION 15.4
Two-Pass Algorithms Based on
Sorting
216

Two-pass Algorithms: where data from the
operand relations is read into main memory,
processed in some way, written out to disk again,
and then reread from disk to complete the
operation
Basic idea
217



Step 1: Read M blocks of R into main memory.
Step 2:Sort these M blocks in main memory, using
an efficient, main-memory sorting algorithm. so
we expect that the time to sort will not exceed the
disk 1/0 time for step (1).
Step 3: Write the sorted list into M blocks of disk.
Duplicate Elimination Using
Sorting δ(R)
218





First sort the tuples of R in sublists
Then use the available main memory to hold one
block from each sorted sublist
Then repeatedly copy one to the output and ignore
all tuples identical to it.
The total cost of this algorithm is 3B(R)
This algorithm requires only √B(R)blocks of main
memory, rather than B(R) blocks(one-pass
algorithm).
Example
219


Suppose that tuples are integers, and only two
tuples fit on a block. Also, M = 3 and the relation R
consists of 17 tuples:
2,5,2,1,2,2,4,5,4,3,4,2,1,5,2,1,3
After first-pass
Sublists
Elements
R1
1,2,2,2,2,5
R2
2,3,4,4,4,5
R3
1,1,2,3,5
Example
220

Second pass
Sublist
In memory
Waiting on disk
R1
1,2
2,2, 2,5
R2
2,3
4,4, 4,5
R3
1,1
2,3,5
Sublist
In memory
Waiting on disk
R1
2
2,2, 2,5
R2
2,3
4,4, 4,5
After processing tuple 1
2,3
Output: 1 R3
Continue the same process with next tuple.
5
Grouping and Aggregation Using
Sorting γ(R)
221





Two-pass algorithm for grouping and aggregation is quite
similar to the previous algorithm.
Step 1:Read the tuples of R into memory, M blocks at a
time. Sort each M blocks, using the grouping attributes of L
as the sort key. Write each sorted sublist to disk.
Step 2:Use one main-memory buffer for each sublist, and
initially load the first block of each sublist into its buffer.
Step 3:Repeatedly find the least value of the sort key
(grouping attributes) present among the first available
tuples in the buffers.
This algorithm takes 3B(R) disk 1/0's, and will work as long
as B(R) < M².
A Sort-Based Union Algorithm
222


For bag-union one-pass algorithm is used.
For set-union
◦
◦
◦
◦

Step 1:Repeatedly bring M blocks of R into main memory, sort their
tuples, and write the resulting sorted sublist back to disk.
Step 2:Do the same for S, to create sorted sublists for relation S.
Step 3:Use one main-memory buffer for each sublist of R and S.
Initialize each with the first block from the corresponding sublist.
Step 4:Repeatedly find the first remaining tuple t among all the
buffers. Copy t to the output. and remove from the buffers all copies
of t (if R and S are sets there should be at most two copies)
This algorithm takes 3(B(R)+B(S)) disk 1/0's, and will work as long
as B(R)+B(S) < M².
Sort-Based Intersection and
Difference
223





For both set and bag version, the algorithm is same as
that of set-union except that the way we handle the
copies of a tuple t at the fronts of the sorted sublists.
For set intersection, output t if it appears in both R
and S.
For bag intersection, output t the minimum of the
number of times it appears in R and in S.
For set difference, R-S, output t if and only if it
appears in R but not in S.
For bag difference, R-S, output t the number of times it
appears in R minus the number of times it appears in
S.
A Simple Sort-Based Join Algorithm
224


When taking a join, the number of tuples from the
two relations that share a common value of the
join attribute(s), need to take in main memory
simultaneously, can exceed what fits in memory
To avoid this situation, try to reduce main-memory
use for other aspects of the algorithm, make
available a large number of buffers to hold the
tuples with a given join-attribute value
A Simple Sort-Based Join Algorithm
225



Given relations R(X, Y) and S(Y, Z) to join, and given M blocks of main
memory for buffers.
Step 1:Sort R and S, using a two-phase, multiway merge sort, with Y as
the sort key.
Step 2:Merge the sorted R and S. The following steps are done
repeatedly:
◦
◦
◦
◦
◦
Find the least value y of the join attributes Y that is currently at the front of the
blocks for R and S.
If y does not appear at the front of the other relation, then remove the
tuple(s) with sort key y.
Otherwise, identify all the tuples from both relations having sort key y.
Output all the tuples that can be formed by joining tuples from R and S with a
common Y-value y.
If either relation has no more unconsidered tuples in main memory.,reload the
buffer for that relation.
A Simple Sort-Based Join Algorithm
226


The simple sort-join uses 5(B(R) + B(S)) disk I/0's.
It requires B(R) ≤ M² and B(S) ≤ M² to work.
A More Efficient Sort-Based Join
227


If we do not have to worry about very large numbers of
tuples with a common value for the join attribute(s), then
we can save two disk 1/0's per block by combining the
second phase of the sorts with the join itself
To compute R(X, Y) ►◄ S(Y, Z) using M main-memory
buffers



Create sorted sublists of size M, using Y as the sort key, for both
R and S.
Bring the first block of each sublist into a buffer
Repeatedly find the least Y-value y among the first available
tuples of all the sublists. Identify all the tuples of both relations
that have Y-value y. Output the join of all tuples from R with all
tuples from S that share this common Y-value
A More Efficient Sort-Based Join
228


The number of disk I/O’s is 3(B(R) + B(S))
It requires B(R) + B(S) ≤ M² to work
Summary of Sort-Based Algorithms
229
Operators
Approximate
M required
Disk I/O
γ,δ
√B
3B
U,∩,−
√(B(R) + B(S))
3(B(R) + B(S))
►◄
√(max(B(R),B(S)))
5(B(R) + B(S))
►◄(more efficient)
√(B(R) + B(S))
3(B(R) + B(S))
230
QUERY
EXECUTION
15.5 TWO-PASS ALGORITHMS
BASED ON HASHING
By
Swathi Vegesna
231
If the data is too big to store in main memory
buffers, hashing is done
 Hash
all the tuples of the argument(s) using an
appropriate hash key.
 For all the common operations, there is a way to
select the hash key so all the tuples that need to
be considered together when we perform the
operation have the same hash value.
 This reduces the size of the operand(s) by a
factor equal to the number of buckets.
Partitioning Relations by
Hashing
232
Algorithm:
initialize M-1 buckets using M-1 empty buffers;
FOR each block b of relation R DO BEGIN
read block b into the Mth buffer;
FOR each tuple t in b DO BEGIN
IF the buffer for bucket h(t) has no room for t THEN
BEGIN
copy the buffer t o disk;
initialize a new empty block in that buffer;
END;
copy t to the buffer for bucket h(t);
END ;
END ;
FOR each bucket DO
IF the buffer for this bucket is not empty THEN
write the buffer to disk;
Duplicate Elimination
233
For the operation δ(R) hash R to M-1 Buckets.
(Note that two copies of the same tuple t will hash to the
same bucket)
 duplicate elimination can be done on each bucket Ri
independently, using one-pass algorithm
 The result is the union of δ(Ri), where Ri is the portion of
R that hashes to the ith bucket

Requirements
234

Number of disk I/O's: 3*B(R)
B(R) < M(M-1), only then the two-pass, hash-based
algorithm will work

In order for this to work, we need:
 hash
function h evenly distributes the tuples
among the buckets
 each bucket Ri fits in main memory (to allow the
one-pass algorithm)
 i.e., B(R) ≤ M2
Grouping and Aggregation
235



Hash all the tuples of relation R to M-1 buckets, using a
hash function that depends only on the grouping
attributes
(Note: all tuples in the same group end up in the same
bucket)
Use the one-pass algorithm to process each bucket
independently
Uses 3*B(R) disk I/O's, requires B(R) ≤ M2
Union, Intersection, and
Difference
236






For binary operation we use the same hash
function to hash tuples of both arguments.
R U S we hash both R and S to M-1
R ∩ S we hash both R and S to 2(M-1)
R-S we hash both R and S to 2(M-1)
Requires 3(B(R)+B(S)) disk I/O’s.
Two pass hash based algorithm requires
min(B(R)+B(S))≤ M2
Hash-Join Algorithm
237





Use same hash function for both relations; hash function
should depend only on the join attributes
Hash R to M-1 buckets R1, R2, …, RM-1
Hash S to M-1 buckets S1, S2, …, SM-1
Do one-pass join of Ri and Si, for all i
3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2
Sort based Vs Hash based
238

For binary operations, hash-based only limits
size to min of arguments, not sum

Sort-based can produce output in sorted order,
which can be helpful

Hash-based depends on buckets being of
equal size

Sort-based algorithms can experience reduced
rotational latency or seek time
Index-Based Algorithms
239
Chapter 15
Section 15.6
Clustering and Nonclustering
Indexes
240

Clustered Relation: Tuples are packed into few
blocks so that it can possibly hold those
tuples

Clustering indexes: Indexes on attributes that
all the tuples with a fixed value for the search
key of this index appear on roughly, so few
blocks can hold them
Clustering and Nonclustering
Indexes
241


A relation that isn’t clustered cannot have a
clustering index
A clustered relation can have nonclustering
indexes
Index-Based Selection
242




For a selection σC(R), by reading all the tuples
of R, suppose C is conditon of the form a=v,
where a is an attribute and v =is the value.
Then one can search the index with the value
v and get pointers to exactly those tuples of R
that have a- value v.
These tuples consitutes, the result of
σa=v(R),we will retrive them.
For clustering index R.a:
the number of disk I/O’s will be B(R)/V(R,a)
Index-Based Selection
243

The actual number may be higher for several
reasons:
1. index is not kept entirely in main
memory,some disk I/O’s are needed to support
the index lookup.
2. Even though all the tuples with a=v might
fit in b blocks, they spread over more blocks b+1
blocks because the don’t start at the beginning
of a block.
3. may not be packed as tightly as possible into
blocks
Example
244





B(R)=1000, T(R)=20,000 number of I/O’s
required:
1. clustered, not index
1000
2. not clustered, not index
20,000
3. If V(R,a)=100, index is clustering
10
4. If V(R,a)=10, index is nonclustering 2,000
Joining by Using an Index
245


Natural join R(X, Y) S S(Y, Z)
Recall that x,y,z are the sets of attributes,
Number of I/O’s to get R
Clustered: B(R)
Not clustered: T(R)
Number of I/O’s to get tuple t of S
Clustered: T(R)B(S)/V(S,Y)
Not clustered: T(R)T(S)/V(S,Y)
Example
246

R(X,Y): 1000 blocks S(Y,Z)=500 blocks
Assume 10 tuples in each block,
so T(R)=10,000 and T(S)=5000
V(S,Y)=100
If R is clustered, and there is a clustering index
on Y for S
the number of I/O’s for R is:
1000
the number of I/O’s for S
is10,000*500/100=50,000
Joins Using a Sorted Index
247




If the index is B-tree, or any other structure from
which we easily can extract the tuples of a
relation in sorted order, we will get number of
opportunites to use the index.
Natural join R(X, Y) S (Y, Z) with index on Y for
either R or S
Extreme case: Zig-zag join
Example: relation R(X,Y) and R(Y,Z) with index on
Y for both relations
search keys (Y-value) for R: 1,3,4,4,5,6
search keys (Y-value) for S: 2,2,4,6,7,8
248
CHAPTER 15.7
BUFFER MANAGEMENT
What does a buffer manager do?
249
Assume there are M of main-memory buffers needed for
the operators on relations to store needed data.
In practice:
1) rarely allocated in advance to the operator
2) the value of M may vary depending on system
conditions.
The central task of making main-memory buffers
available to processes, such as queries, that act
on the database is given to the buffer manager.
Therefore, buffer manager is used to allow processes
to get the memory they need, while minimizing the
delay and unclassifiable requests.
250
The role of the buffer manager
Read/Writes
Requests
Buffers
Buffer
manager
Figure 1: The role of the buffer manager : responds to requests for
main-memory access to disk blocks
15.7.1 Buffer Management Architecture
251
Two broad architectures for a buffer manager:
1) The buffer manager controls main memory directly.
• Relational DBMS
2) The buffer manager allocates buffers in virtual memory,
allowing the OS to decide how to use buffers are
actually in main memory at any time and which are
in the “swap space” on disk that the oerating system
manages.
• “main-memory” DBMS
• “object-oriented” DBMS
Buffer Pool
252
Key setting for the Buffer manager to be efficient:
The buffer manager should limit the number of buffers in
use so that they fit in the available main memory, i.e.
Don’t exceed available space, it has to select a buffer
to empty, by returning its contents to disk.
The number of buffers is a parameter set when the DBMS is
initialized.
No matter which architecture of buffering is used, we simply
assume that there is a fixed-size buffer pool, a set of
buffers available to queries and other database actions.
253
Buffer Pool
Page Requests from Higher Levels
BUFFER POOL
disk page
free frame
MAIN MEMORY
DISK


DB
choice of frame dictated
by replacement policy
Data must be in RAM for DBMS to operate on it!
Buffer Manager hides the fact that not all data is in RAM.
15.7.2 Buffer Management
Strategies
Buffer-replacement strategies:
When a buffer is needed for a newly requested
block and the buffer pool is full, what block to
throw out the buffer pool?
254
255
Buffer-replacement strategy
-- LRU
Least-Recently Used (LRU):
To throw out the block that has not been read or written for
the longest time. This method requires that the buffer
manager maintain a table indicating the last time the
block in each buffer was accessed.
It also requires that each database access make an
entry in this table, so there is significant effort in
maintating this information
• Requires more maintenance but it is effective.
• Update the time table for every access.
• Least-Recently Used blocks are usually less likely to
be accessed sooner than other blocks.
256
Buffer-replacement strategy
-- FIFO
First-In-First-Out (FIFO):
The buffer that has been occupied the longest by the same
block is emptied and used for the new block.
• Requires less maintenance but it can make more mistakes.
• In this approach, the buffer manager needs to know only
the time at which the block is read from disk, and there
is no need to modify the table when the block is
accessed. Keep only the loading time
• The oldest block doesn’t mean it is less likely to be
accessed. Example: the root block of a B-tree index
257
Buffer-replacement strategy
– “Clock”
The “Clock” Algorithm (“Second Chance”)
Efficient approximation to LRU
Think of the 8 buffers as arranged in a circle, shown as
Figure 3
Flag 0 and 1:
buffers with a 0 flag are ok to sent their contents back
to disk, i.e. ok to be replaced
buffers with a 1 flag are not ok to be replaced
258
Buffer-replacement strategy
– “Clock”
0
0
1
0
the buffer with
a 0 flag will
be replaced
0
0
1
1
Start point to
search a 0 flag
The flag will
be set to 0
By next time the hand
reaches it, if the content of
this buffer is not accessed,
i.e. flag=0, this buffer will
be replaced.
That’s “Second Chance”.
Figure 3: the clock algorithm
Buffer-replacement strategy -Clock
259
a buffer’s flag set to 1 when:
a block is read into a buffer
the contents of the buffer is accessed
a buffer’s flag set to 0 when:
the buffer manager needs a buffer for a new block, it
looks for the first 0 it can find, rotating clockwise. If it
passes 1’s, it sets them to 0.
System Control helps Bufferreplacement strategy
260
System Control
The query processor or other components of a DBMS can
give advice to the buffer manager in order to avoid some
of the mistakes that would occur with a strict policy such
as LRU,FIFO or Clock.
For example:
A “pinned” block means it can’t be moved to disk without
first modifying certain other blocks that point to it.
In FIFO, use “pinned” to force root of a B-tree to remain in
memory at all times.
261
15.7.3 The Relationship Between
Physical Operator Selection and
Buffer Management
Problem:
Physical Operator expected certain number of
buffers M for execution.
However, the buffer manager may not be able
to guarantee these M buffers are available.
262
15.7.3 The Relationship Between
Physical Operator Selection and
Buffer Management
Questions:
Can the algorithm adapt to changes of M, the
number of main-memory buffers available?
When available buffers are less than M, and some
blocks have to be put in disk instead of in memory.
How the buffer-replacement strategy impact the
performance (i.e. the number of additional I/O’s)?
Example
263
FOR each chunk of M-1 blocks of S DO BEGIN
read these blocks into main-memory buffers;
organize their tuples into a search structure whose
search key is the common attributes of R and S;
FOR each block b of R DO BEGIN
read b into main memory;
FOR each tuple t of b DO BEGIN
find the tuples of S in main memory that
join with t ;
output the join of t with each of these tuples;
END ;
END ;
END ;
Figure 15.8: The nested-loop join algorithm
15.7.3 The Relationship Between Physical
Operator Selection and Buffer Management
264
The outer loop number (M-1) depends on the average
number of buffers are available at each iteration.
The outer loop use M-1 buffers and 1 is reserved for a block
of R, the relation of the inner loop.
If we pin the M-1 blocks we use for S on one iteration of the
outer loop, we shall not lose their buffers during the round.
Also, more buffers may become available and then we could
keep more than one block of R in memory.
Will these extra buffers improve the running time?
Example
265
CASE1: NO
Buffer-replacement strategy: LRU
Buffers for R: k
We read each block of R in order into buffers.
By end of the iteration of the outer loop, the last k blocks of R
are in buffers.
However, next iteration will start from the beginning of R
again.
Therefore, the k buffers for R will need to be replaced.
Example
266
CASE 2: YES
Buffer-replacement strategy: LRU
Buffers for R: k
We read the blocks of R in an order that alternates:
firstlast and then lastfirst.
In this way, we save k disk I/Os on each iteration of the outer
loop except the first iteration.
Other Algorithms and M buffers
267
Other Algorithms also are impact by M and the
buffer-replacement strategy.
Sort-based algorithm
If M shrinks, we can change the size of a
sublist.
Unexpected result: too many sublists to
allocate each sublist a buffer.
Hash-based algorithm
If M shrinks, we can reduce the number of
buckets, as long as the buckets still can fit in M
buffers.
268
15.8 ALGORITHMS USING
MORE THAN TWO PASSES
Presented by: Kai Zhu
Professor: Dr. T.Y. Lin
Class ID: 220
269
Reason that we use more than two passes:
Two passes are usually enough, however, for
the largest relation, we use as many passes as
necessary.
270
MULTI-PASS SORT-BASED
ALGORITHMS
Suppose we have M main-memory buffers
available to sort a relation R, which we
assume is stored clustered.
Then we do the following:
271
BASIS:
If R fits in M blocks (i.e., B(R)<=M)
1. Read R into main memory M, which
we can call R1,R2, … RM
2. Sort it using any main-memory
sorting algorithm.
3. Write the sorted relation to disk.
272
INDUCTION:
If R does not fit into main memory.
1. Partition the blocks holding R into
M
groups, which we shall call R1, R2, R3…
2. Recursively sort Ri for each
i=1,2,3…M.
3. Merge the M sorted sublists.
273
If we are not merely sorting R, but performing
a unary operation such as δ or γ on R.
We can modify the above so that at the final
merge we perform the operation on the tuples
at the front of the sorted sublists.
That is:
274


For a δ, output one copy of each distinct tuple,
and skip over copies of the tuple.
For a γ, sort on the grouping attributes only,
and combine the tuples with a given value of
these grouping attributes.
275
Conclusion
The two pass algorithms based on sorting or
hashing have natural recursive analogs that
take three or more passes and will work for
larger amounts of data.
Perfomance of Multipass, SortBased Algorithms
276



BASIS: If k=1,one pass is allowed, then we must
have B(R) <= M put another way, s(M,1) = M
Induction:Suppose k>1. the partition R into M
pieces, each of which is sortable in k-1 passes.
If B(R) = s(M,k) then s(M,k) /M which is the size
of each of the M pieces of R cannot exceed s(M,
k-1).
277
The Query Compiler
16.1 Parsing and Preprocessing
Query compilation is divided
into three steps
Query Compiler perform the following operations :
1. Parsing: Parse SQL query into parser tree.
2. Logical query plan: Transforms parse tree into
expression tree of relational algebra.
3.Physical query plan:
278
Transforms logical query plan
into physical query plan.
. Operation performed
. Order of operation
. Algorithm used
. The way in which stored data is obtained and passed from
one
operation to another.
Query
Parser
Preprocessor
Logical Query plan
generator
Query rewrite
Preferred logical
query plan
279
Form a query to a logical query
plan
Syntax Analysis and Parse Tree
Parser takes the sql query and convert it to
parse
tree. Nodes of parse tree are :
1. Atoms: known as Lexical elements such as
key words, constants, parentheses, operators,
and other schema elements.
2. Syntactic categories: Subparts that plays a
similarrole in a query as <Query> , <Condition>
280
Grammar for Simple Subset of SQL
<Query> ::= <SFW>
<Query> ::= (<Query>)
281
<SFW> ::= SELECT <SelList> FROM <FromList> WHERE <Condition>
<SelList> ::= <Attribute>,<SelList>
<SelList> ::= <Attribute>
<FromList> ::= <Relation>, <FromList>
<FromList> ::= <Relation>
<Condition> ::= <Condition> AND <Condition>
<Condition> ::= <Tuple> IN <Query>
<Condition> ::= <Attribute> = <Attribute>
<Condition> ::= <Attribute> LIKE <Pattern>
<Tuple> ::= <Attribute>
Atoms(constants), <syntactic categories>(variable),
::= (can be expressed/defined as)
Query and Parse Tree
282
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate)
Query:
Give titles of movies that have at least one star born in
1960
SELECT title FROM StarsIn WHERE starName IN
(
SELECT name FROM MovieStar WHERE
birthdate LIKE '%1960%'
);
283
Another query equivalent
284
SELECT title
FROM StarsIn, MovieStar
WHERE starName = name AND
birthdate LIKE '%1960%' ;
Parse Tree
<Query>
<SFW>
SELECT <SelList> FROM
<Attribute>
<FromList>
WHERE
<RelName> , <FromList>
title
StarsIn
<Condition>
starName
285
=
AND
<RelName>
MovieStar
<Attribute>
<Condition>
<Attribute>
name
<Query>
<Condition>
<Attribute> LIKE <Pattern>
birthdate
‘%1960’
The Preprocessor
286
Functions of Preprocessor
. If a relation used in the query is virtual view then each use of
this relation in the form-list must replace by parser tree that
describe the view.
. It is also responsible for semantic checking
1. Checks relation uses : Every relation mentioned in FROMclause must be a relation or a view in current schema.
2. Check and resolve attribute uses: Every attribute mentioned
in SELECT or WHERE clause must be an attribute of same
relation in the current scope.
3. Check types: All attributes must be of a type appropriate to
their uses.
287
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate)
Query:
Give titles of movies that have at least one star born in
1960
SELECT title FROM StarsIn WHERE starName IN
(
SELECT name FROM MovieStar WHERE
birthdate LIKE '%1960%'
);
Preprocessing Queries Involving Views
288
When an operand in a query is a virtual view, the
preprocessor needs to replace the operand by a piece of
parse tree that represents how the view is constructed from
base table.
Base Table: Movies( title, year, length, genre, studioname,
producerC#)
View definition : CREATE VIEW ParamountMovies AS
SELECT title, year FROM movies
WHERE studioName = 'Paramount';
Example based on view:
SELECT title FROM ParamountMovies WHERE year = 1979;
289
16.2 ALGEBRAIC LAWS
FOR IMPROVING QUERY
PLANS
Optimizing the Logical Query
Plan
290

The translation rules converting a parse tree to a logical
query tree not always produce the best logical query
tree.

The best logical query tree is often possible
optimized by applying relational algebra laws to convert
the original tree into a more efficient logical query tree.
Logical query tree optimized using relational algebra
laws is called heuristic optimization
Relational Algebra Laws
291
These laws often involve the properties of:

commutativity - operator can be applied to operands
independent of order.
 E.g. A +

B = B + A - The “+” operator is commutative.
associativity - operator is independent of operand
grouping.
(B + C) = (A + B) + C - The “+” operator is
associative.
 E.g. A +
Associative and Commutative
Operators
292

The relational algebra operators of cross-product
(×), join (⋈), union, and intersection are all
associative and commutative.
Commutative
Associative
R X S=S X R
(R X S) X T = S X (R X T)
R⋈S=S⋈R
(R ⋈ S) ⋈ T= S ⋈ (R ⋈ T)
RS=SR
(R  S)  T = S  (R  T)
R ∩S =S∩ R
(R ∩ S) ∩ T = S ∩ (R ∩ T)
Laws Involving Selection
293


Complex selections involving AND or OR can be broken into
two or more selections: (splitting laws)
σC1 AND C2 (R) = σC1( σC2 (R))
σC1 OR C2 (R) = ( σC1 (R) ) S ( σC2 (R) )
Example






R={a,a,b,b,b,c}
p1 satisfied by a,b, p2 satisfied by b,c
σp1vp2 (R) = {a,a,b,b,b,c}
σp1(R) = {a,a,b,b,b}
σp2(R) = {b,b,b,c}
σp1 (R) U σp2 (R) = {a,a,b,b,b,c}
Laws Involving Selection
(Contd..)
294

Selection is pushed through both arguments
for union:
σC(R  S) = σC(R)  σC(S)

Selection is pushed to the first argument and
optionally the second for difference:
σC(R - S) = σC(R) - S
σC(R - S) = σC(R) - σC(S)
Laws Involving Selection
(Contd..)
295


All other operators require selection to be pushed
to only one of the arguments.
For joins, not be able to push selection to both if
argument does not have attributes selection
requires.
σC(R × S) = σC(R) × S
σC(R ∩ S) = σC(R) ∩ S
σC(R ⋈ S) = σC(R) ⋈ S
σC(R ⋈D S) = σC(R) ⋈D S
Laws Involving Selection
(Contd..)
296
Example
 Consider relations R(a,b) and S(b,c) and the
expression
 σ (a=1 OR a=3) AND b<c (R ⋈S)

σ a=1 OR a=3(σ b<c (R ⋈S))
 σ a=1 OR a=3(R ⋈ σ b<c (S))
 σ a=1 OR a=3(R) ⋈ σ b<c (S)

Laws Involving Projection
297


Like selections, it is also possible to push
projections down the logical query tree.
However,
projections just reduce the number of
attributes instead of reducing the number of
tuples , hence the performance gained is
less than selection.
Laws Involving Projection
298

Laws for pushing projections with joins:
πL(R × S) = πL(πM(R) × πN(S))
πL(R ⋈ S) = πL((πM(R) ⋈ πN(S))
πL(R ⋈D S) = πL((πM(R) ⋈D πN(S))
Laws Involving Projection
299

Laws for pushing projections with set operations.

Projection can be performed entirely before union.
πL(R UB S) = πL(R) UB πL(S)

Projection can be pushed below selection as long
as, keep all attributes needed for the selection (M
= L  attr(C)).
πL ( σC (R)) = πL( σC (πM(R)))
Laws Involving Join
300

We already learned about
1. Joins are commutative and associative.
2. Selection can be distributed into joins.
3. Projection can be distributed into joins.
Laws Involving Duplicate
Elimination
301




The duplicate elimination operator (δ) can be
pushed through many operators.
R has two copies of tuples t, S has one copy
of t,
δ (RUS)=one copy of t
δ (R) U δ (S)=two copies of t
Laws Involving Duplicate
Elimination
302

Laws for pushing duplicate elimination
operator (δ):
δ(R × S) = δ(R) × δ(S)
δ(R S) = δ(R)
δ(S)
δ(R D S) = δ(R)
D δ(S)
δ( σC(R) = σC(δ(R))
Laws Involving Duplicate
Elimination
303

The duplicate elimination operator (δ) can also
be pushed through bag intersection, but not
across union, difference, or projection in
general.
δ(R ∩ S) = δ(R) ∩ δ(S)
Laws Involving Grouping
304


The grouping operator (γ) laws depend on the aggregate
operators used.
There is one general rule, however, that grouping subsumes
duplicate elimination:
δ(γL(R)) = γL(R)

The reason is that some aggregate functions are unaffected by
duplicates (MIN and MAX) while other functions are (SUM,
COUNT, and AVG).
305
Section 16.3
Presented By:
Under the supervisiundu
Dr. T.Y
Topics to be covered
306

From Parse to Logical Query
Plans
 Conversion
to Relational Algebra
 Removing Subqueries From Conditions
 Improving the Logical Query Plan
 Grouping Associative/ Commutative Operators
Parsing
307
• Preferred L Q ‘s Goal is to convert a text string
containing a query into a parse tree data
structure:
– leaves form the text string (broken into lexical
elements)
– internal nodes are syntactic categories
• Uses standard algorithmic techniques from
compilers
– given a grammar for the language (e.g., SQL),
process the string and build the tree
Review
308
Query
Parser
Section 16.1
Preprocessor
Logical query plan
generator
Section 16.3
Query Rewriter
Preferred logical query plan
Two steps to turn Parse tree into
Preferred Logical Query Plan
309

The nodes and structures of the parse tree is replaced,
in appropriate groups, by an operator or operators of
relational algebra.

The relational algebra expression is considered and
turn it into an expression that can be converted to the
most efficient physical query plan.
Reference Relations
310

StarsIn (movieTitle, movieYear, starName)

MovieStar (name, address, gender, birthdate)
SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE ‘%1960’
);
(Find the movies with stars born in 1960)
Assume we have a simplified grammar for SQL.
Conversion to Relational Algebra
311

If we have a <Query> with a <Condition> that has
no subqueries, then we may replace the entire
construct – the select-list, from-list, and condition –
by a relational-algebra expression.
•
The complete algorithm depends on specific
grammar, which determines forms of the parse
trees
312
The relational-algebra expression
consists of the following from
bottom to top:
The products of all the relations mentioned in the
<FromList>, which Is the argument of:
 A selection σC, where C is the <Condition> expression in
the construct being replaced, which in turn is the
argument of:
 A projection πL , where L is the list of attributes in the
<SelList>

A query : Example
313

SELECT movieTitle
FROM Starsin, MovieStar
WHERE starName = name AND
birthdate LIKE ‘%1960’;
SELECT movieTitle
FROM Starsin, MovieStar
WHERE starName = name AND
birthdate LIKE ‘%1960’;
314
Translation to an algebraic expression tree
315
The Preprocessor
• It replaces each reference to a view with a
parse (sub)-tree that describes the view (i.e., a
query)
• It does semantic checking:
– are relations and views mentioned in the schema?
– are attributes mentioned in the current scope?
– are attribute types correct?
316
Removing Subqueries From Conditions
317



For parse trees with a <Condition> that has a
subquery
Intermediate operator – two argument
selection
It is intermediate in between the syntactic
categories of the parse tree and the relationalalgebra operators that apply to relations.
Using a two-argument σ
318
πmovieTitle
σ
<Condition>
StarsIn
<Tuple>
<Attribute>
starName
IN
πname
σ birthdate LIKE ‘%1960'
MovieStar
Converting Two-Argument
Selection
• To continue the conversion, we need rules for
replacing two-argument selection with a
relational algebra expression
• Different rules depending on the nature of the
sub query
• Here is shown an example for IN operator and
uncorrelated query (sub query computes a
relation independent of the tuple being tested)
319
Two argument selection with condition
involving IN
320
We have, two arguments – some relation and the
second argument is a <Condition> of the form t IN S.



‘t’ – tuple composed of some attributes of R
‘S’ – uncorrelated subquery
Steps to be followed:

1.
2.
3.
Replace the <Condition> by the tree that is the expression for
S ( δ is used to remove duplicates)
Replace the two-argument selection by a one-argument
selection σC.
Give σC an argument that is the product of R and S.
Two argument selection with condition
involving IN
321
σ
R
σC
<Condition>
t
IN
S
X
R
δ
S
The effect
322
What if Subquery is Correlated?




323
Example is when subquery refers to the
current tuple of the outer scope that is being
tested
More complicated to deal with, since
subquery cannot be translated in isolation
Need to incorporate external attributes in the
translation
Some details are in textbook
Improving the Logical Query Plan
324

Algebraic laws to improve logical query plans:
 Selections
can be pushed down the expression
tree as far as they can go.
 Similarly, projections can be pushed down the
tree, or new projections can be added.
 Duplicate eliminations can sometimes be
removed, or moved to a more convenient position
in the tree.
 Certain selections can be combined with a
product below to turn the pair of operations into
an equijoin.
Grouping Associative/ Commutative
Operators
325




An operator that is associative and commutative
operators may be though of as having any number
of operands.
We need to reorder these operands so that the
multiway join is executed as sequence of binary
joins.
Its more time consuming to execute them in the
order suggested by parse tree.
For each portion of subtree that consists of nodes
with the same associative and commutative
operator (natural join, union, and intersection), we
group the nodes with these operators into a single
node with many children.
The effect of query rewriting
326
Π movieTitle
Starname = name
StarsIn
σbirthdate LIKE ‘%1960’
MovieStar
Final step in producing logical query plan
327
U
=>
R
U
U
U
R
S
T
V
W
S
U
T
V
W
An Example to summarize
328

“find movies where the average age of the stars was at
most 40 when the movie was made”
SELECT distinct m1.movieTitle,
m1,movieYear
FROM StarsIn m1
WHERE m1.movieYear – 40 <= (
SELECT AVG (birthdate)
FROM StartsIn m2, MovieStar s
WHERE m2.starName = s.name AND
m1.movieTitle = m2.movieTitle AND
m1.movieYear = m2.movieyear
);
329
SELECT distinct m1.movieTitle, m1,movieYear
FROM StarsIn m1
WHERE m1.movieYear – 40 <= (
SELECT AVG (birthdate)
FROM StartsIn m2, MovieStar s
WHERE m2.starName = s.name AND
m1.movieTitle = m2.movieTitle AND
m1.movieYear = m2.movieyear );
Selections combined with a product to turn
the pair of operations into an equijoin…
330
Condition pushed up the expression tree…
331
`
332
Selections combined…
333
334
335
16.4 FROM ESTIMATING THE COST
OF OPERATION ►
Estimating the Cost of Operations
336



After getting to the logical query plan, we turn it into
physical plan.
Consider all the possible physical plan and estimate
their costs – this evaluation is known as cost-based
enumeration.
The one with least estimated cost is the one
selected to be passed to the query-execution
engine.
Selection for each physical plan
337

We select for each physical plan:
An order and grouping for associative-andcommutative operations like joins, unions, and
intersections.
 An algorithm for each operator in the logical plan, for
instance, deciding whether a nested-loop join or hashjoin should be used.
 Additional operators – scanning, sorting etc. – that are
needed for the physical plan but that were not present
explicitly in the logical plan.
 The way in which the arguments are passed from on
operator to the next.

Estimating Sizes of Intermediate
Relations
338
1.
2.
3.
Rules for estimating the number of tuples
in an intermediate relation:
Give accurate estimates.
Are easy to compute.
Are logically consistent; that is, the size
estimate for an intermediate relation should
not depend on how that relation is computed.
Objective of estimation is to select best
physical plan with least cost.
Estimating the Size of a Projection
339




The projection is different from the other
operators,
We should treat a classical, duplicateeliminating projection as a bag-projection.
The size of the result can be computed exactly.
There may be reduction in size (due to
eliminated components) or increase in size (due
to new components created as combination of
attributes).
Estimating the Size of a Selection
340


While performing selection, we may reduce the
number of tuples but the sizes of tuple remain
same.
Size can be computed as:
S = σ A=c (R)
Where A is an attribute of R and c is a constant

The recommended
is
T(S) = T(R)/estimate
V(R,A)
The rule above surely holds if all values of
attribute A occur equally often in the
database
Estimating Sizes of Other
Operations
341





Union
Intersection
Difference
Duplicate Elimination
Grouping and Aggregation
Estimating the Size of a Join
342

two simplifying assumptions:
1. Containment of Value Sets
If R and S are two relations with attribute Y and V(R,Y)<=V(S,Y)
then every Y-value of R will be a Y-value of S.
2. Preservation of Value Sets
Join a relation R with another relation S with attribute A in R and
not in S then all distinct values of A are preserved and
not lost.V(S
R,A) = V(R,A)
Under these assumptions, we estimate

T(R S) = T(R)T(S)/max(V(R,Y), V(S, Y))
Natural Joins With Multiple Join
Attributes
343


Of the T(R),T(S) pairs of tuples from R and S, the
expected number of pairs that match in both y1 and y2 is:
T(R)T(S)/max(V(R,y1), V(S,y1)) max(V(R, y2), V(S, y2))


In general, the following rule can be used to estimate the
size of a natural join when there are any number of attributes
shared between the two relations.


The estimate of the size of R
S is computed by
multiplying T(R) by T(S) and dividing by the largest of V(R,y)
and V(S,y) for each attribute y that is common to R and S.
Joins of Many Relations(1)
344





rule for estimating the size of any join
Start with the product of the number of
tuples in each relation.
Then, for each attribute A appearing at
least twice, divide by all but the least of
V(R,A)’s.
Can estimate the number of values that
will remain for attribute A after the join.
By the preservation-of-value-sets
assumption, it is the least of these V(R,A)’s.
Joins of Many Relations(2)
345


Based on the two assumptionscontainment and preservation of value sets:
In a natural join of n relations, the estimation
of rules, applied to each join individually,
yield the same estimate for the size of the
result. Moreover, this estimate is the same
that we get if we apply the rule for the join of
all n relations as a whole.
Estimating Sizes for Other
Operations
346

Union: the average of the sum and the larger.
Intersection:
 approach1: take the average of the extremes,
which is the half the smaller.



approach2:
intersection is an extreme case of
the natural join, use the formula
T(R
S) = T(R)T(S)/max(V(R,Y), V(S, Y))
Estimating Sizes for Other
Operations
347

Difference: T(R)-(1/2)*T(S)

Duplicate Elimination: take the smaller of
(1/2)*T(R) and the product of all the V(R, ai )’s.

Grouping and Aggregation: upper-bound the
number of groups by a product of V(R,A)’s,
here attribute A ranges over only the grouping
attributes of L.
348
CHOOSING AN ORDER
FOR JOINS
INTRODUCTION
349
Significance of Left and Right Join
Arguments


The argument relations in joins determine the
cost of the join
The left argument of the join is
 Called
the build relation
 Assumed to be smaller
 Stored in main-memory
350
Significance of Left and Right Join
Arguments

The right argument of the join is
 Called
the probe relation
 Read a block at a time
 Its tuples are matched with those of build relation

The join algorithms which distinguish between
the arguments are:
 One-pass
join
 Nested-loop join
 Index join
351
Join Trees




352
Order of arguments is important for joining two
relations
Left argument stored in main-memory, should
be smaller
With two relations only two choices of join tree
With more than two relations, there are n!
ways to order the arguments and therefore n!
join trees, where n is the no. of relations
Join Trees




353
Order of arguments is important for joining two
relations
Left argument, since stored in main-memory,
should be smaller
With two relations only two choices of join tree
With more than two relations, there are n!
ways to order the arguments and therefore n!
join trees, where n is the no. of relations
Join Trees

Total # of tree shapes T(n) for n relations given
by recurrence:

T(1) = 1
T(2) = 1
T(3) = 2
T(4) = 5 … etc



354
Left-Deep Join Trees

355
Consider 4 relations. Different ways to join
them are as follows




356
In fig (a) all the right children are leaves. This
is a left-deep tree
In fig (c) all the left children are leaves. This is
a right-deep tree
Fig (b) is a bushy tree
Considering left-deep trees is advantageous
for deciding join orders
Join order
357

Join order selection
 A1
A2
A3
..
 Left deep join trees
An
An
Ai
 Dynamic
 Best

programming
plan computed for each subset of relations
Best plan (A1, .., An) = min cost plan of(
Best plan(A2, .., An)
A1
Best plan(A1, A3, .., An)
A2
….
Best plan(A1, .., An-1))
An
Dynamic Programming to Select a
Join Order and Grouping


358
Three choices to pick an order for the join of many
relations are:
 Consider all of the relations
 Consider a subset
 Use a heuristic o pick one
Dynamic programming is used either to consider all or a
subset
 Construct a table of costs based on relation size
 Remember only the minimum entry which will required
to proceed
Dynamic Programming to Select a
Join Order and Grouping
359
Dynamic Programming to Select a
Join Order and Grouping
360
Dynamic Programming to Select a
Join Order and Grouping
361
Dynamic Programming to Select a
Join Order and Grouping
362
A Greedy Algorithm for Selecting a
Join Order



363
It is expensive to use an exhaustive method
like dynamic programming
Better approach is to use a join-order heuristic
for the query optimization
Greedy algorithm is an example of that in
which, Make one decision at a time about
order of join and never backtrack on the
decisions once made
364
COMPLETING THE PHYSICAL-QUERYPLAN AND CHAPTER 16 SUMMARY
(16.7-16.8)
CS257 Spring 2009
Professor Tsau Lin
Student: Suntorn Sae-Eung
Donavon Norwood
VI. Notation for Physical Query Plans
365
•
•


A logical operator σc(R) is often combined, or partially
combined, with the access method for relation R, when R is a
stored relation
Other selections, where the argument is not a stored relation
or an appropriate index is not available, will be replaced by
the corresponding physical operator we have called Filter.
Several types of operators:
1.
Operators for leaves
2.
(Physical) operators for Selection
3.
(Physical) Sorts Operators
4.
Other Relational-Algebra Operations
In practice, each DBMS uses its own internal notation for
physical query plan.
Notation for Physical Query Plans
(cont.)
366
Operator for leaves
1.

A leaf operand is replaced in LQP tree




TableScan(R) : read all blocks
SortScan(R, L) : read in order according to L
IndexScan(R, C): scan index attribute A by
condition C of form Aθc.
IndexScan(R, A) : scan index attribute R.A.
This behaves like TableScan but more efficient if
R is not clustered.
Notation for Physical Query Plans
(cont.)
367
(Physical) operators for Selection
2.

Logical operator σC(R) is often combined with
access methods.


If σC(R) is replaced by Filter(C), and there is no
index on R or an attribute on condition C

Use TableScan or SortScan(R, L) to access R
If condition C  Aθc AND D for condition D, and
there is an index on R.A, then we may

Use operator IndexScan(R, Aθc) to access R
and

Use Filter(D) in place of the selection σC(R)
Notation for Physical Query Plans
(cont.)
368
(Physical) Sort Operators
3.



Sorting can occur any point in physical plan,
which use a notation SortScan(R, L).
It is common to use an explicit operator Sort(L)
to sort relation that is not stored.
Can apply at the top of physical-query-plan tree
if the result needs to be sorted with ORDER BY
clause (г).
Notation for Physical Query Plans
(cont.)
369
Other Relational-Algebra Operations
4.

Descriptive text definitions and signs to elaborate

Operations performed e.g. Join or grouping.

Necessary parameters e.g. theta-join or list of
elements in a grouping.

A general strategy for the algorithm e.g. sortbased, hashed based, or index-based.

A decision about number of passed to be used e.g.
one-pass, two-pass or multipass.

An anticipated number of buffers the operations
will required.
Notation for Physical Query Plans
(cont.)
370

Example of a physical-query-plan

A physical-query-plan in example 16.36 for the case
k > 5000




TableScan
Two-pass hash join
Materialize (double line)
Store operator
Notation for Physical Query Plans
(cont.)
371

Another example

A physical-query-plan in example 16.36 for the case
k < 49





TableScan
(2) Two-pass hash join
Pipelining
Different buffers needs
Store operator
Notation for Physical Query Plans
(cont.)
372

A physical-query-plan in example 16.35


Use Index on condition y = 2 first
Filter with the rest condition later on.
VII. Ordering of Physical Operations
373
The PQP is represented as a tree structure
implied order of operations.
Still, the order of evaluation of interior nodes
may not always be clear.




Iterators are used in pipeline manner
Overlapped time of various nodes will make
“ordering” no sense.
374
Ordering of Physical Operations
(cont.)
•
The following rules summarize the ordering of events implicit
in a physical-query-plan tree:
•
Following this strategy, the query optimizer can now generate
executable code, perhaps a sequence of function calls, for the
query.
3 rules summarize the ordering of events in a PQP tree:
1.
Break the tree into sub-trees at each edge that represent
materialization.

Execute one subtree at a time.
2.
Order the execution of the subtree

Bottom-top

Left-to-right
3.
All nodes of each sub-tree are executed simultaneously.

375

Summary of Chapter 16



Compilation means turning a query into a
physical query plan, which can be
implemented by query engine.
Steps of query compilation :




376
COMPILATION OF QUERIES
Parsing
Semantic checking
Selection of the preferred logical query plan
Generating the best physical plan





377
THE PARSER
The first step of SQL query processing.
Generates a parse tree
Nodes in the parse tree corresponds to the
SQL constructs
Similar to the compiler of a programming
language




378
VIEW EXPANSION
A very critical part of query compilation.
Expands the view references in the query tree
to the actual view.
Provides opportunities for the query
optimization.




Checks the semantics of a SQL query.
Examines a parse tree.
Checks :




379
SEMANTIC CHECKING
Attributes
Relation names
Types
Resolves attribute references.




380
CONVERSION TO A LOGICAL
QUERY PLAN
Converts a semantically parsed tree to a
algebraic expression.
Conversion is straightforward but sub queries
need to be optimized.
Two argument selection approach can be
used.



381
ALGEBRAIC TRANSFORMATION
To transform a logical query plan to an actual plan using
algebraic transformations.
The different laws used for this transformation :
 Commutative and associative laws
 Laws involving selection
 Pushing selection
 Laws involving projection
 Laws about joins and products
 Laws involving duplicate eliminations
 Laws involving grouping and aggregation



When selecting the best logical plan, true
running time is taken into consideration.
Two factors the affects the most in
estimating the sizes of relation :



382
ESTIMATING SIZES OF
RELATIONS
Size of relations ( No. of tuples )
No. of distinct values for each attribute of each
relation
Histograms are used by some systems.



Best physical query plan represents the least
costly plan.
Factors that decide the cost of a query plan :




383
COST BASED OPTIMIZING
Order and grouping operations like joins, unions
and intersections.
Nested loop and the hash loop joins used.
Scanning and sorting operations.
Storing intermediate results.


Common approaches for searching the space
for best physical plan .




384
PLAN ENUMERATION
STRATEGIES
Dynamic programming : Tabularizing the best
plan for each sub expression
Selinger style programming : sort-order the
results as a part of table
Greedy approaches : Making a series of locally
optimal decisions
Branch-and-bound : Starts with enumerating the
worst plans and reach the best plan




385
LEFT-DEEP JOIN TREES
Left – Deep Join Trees are the binary trees
with a single spine down the left edge and with
leaves as right children.
This strategy reduces the number of plans to
be considered for the best physical plan.
Restrict the search to Left – Deep Join Trees
when picking a grouping and order for the join
of several relations.
PHYSICAL PLANS FOR SELECTION



386
Breaking a selection into an index-scan of
relation, followed by a filter operation.
The filter then examines the tuples retrieved by
the index-scan.
Allows only those to pass which meet the
portions of selection condition.
PIPELINING VERSUS
MATERIALIZING





387
This flow of data between the operators can be
controlled to implement “ Pipelining “ .
The intermediate results should be removed from main
memory to save space for other operators.
This techniques can implemented using “
materialization “ .
Both the pipelining and the materialization should be
considered by the physical query plan generator.
An operator always consumes the result of other
operator and is passed through the main memory.
388
QUERY COMPILER: 16.7
COMPLETING THE PHYSICAL
QUERY-PLAN
CS257 Spring 2009
Professor Tsau Lin
Student: Suntorn Sae-Eung
ID: 212
Before complete Physical-QueryPlan
389

A query previously has been
 Parsed
and Preprocessed (16.1)
 Converted to Logical Query Plans (16.3)
 Estimated the Costs of Operations (16.4)
 Determined costs by Cost-Based Plan Selection
(16.5)
 Weighed costs of join operations by choosing an
Order for Joins
16.7 Completing the Physical-QueryPlan
390

3 topics related to turning LP into a
complete physical plan
1.
2.
3.
Choosing of physical implementations such as
Selection and Join methods
Decisions regarding to intermediate results
(Materialized or Pipelined)
Notation for physical-query-plan operators
I. Choosing a Selection Method (A)
391

Algorithms for each selection operators
1. Can we use an created index on an attribute?
 If yes, index-scan. Otherwise table-scan)
2. After retrieve all condition-satisfied tuples in (1),
then filter them with the rest selection conditions
Choosing a Selection Method(A)
(cont.)
392
•
To pick algorithms for each selection operator.
•
Assuming there are no multidimensional indexes on several of the attributes, then each
physical plan uses some number of attributes that each:

Recall  Cost of query = # disk I/O’s

How costs for various plans are estimated from σC(R) operation
1. Cost of table-scan algorithm
a)
B(R)
if R is clustered
b)
T(R)
if R is not clustered
2. Cost of a plan picking an equality term (e.g. a = 10) w/ index-scan
a)
B(R) / V(R, a)
clustering index
b)
T(R) / V(R, a)
nonclustering index
3. Cost of a plan picking an inequality term (e.g. b < 20) w/ index-scan
a)
B(R) / 3
clustering index
b)
T(R) / 3
nonclustering index
Example
393
Selection: σx=1 AND y=2 AND z<5 (R)
- Where paremeters of R(x, y, z) are :
T(R)=5000,
B(R)=200,
V(R,x)=100, and
V(R, y)=500
-
-
Relation R is clustered
x, y have nonclustering indexes, only index on z
is clustering.
Example (cont.)
394
Selection options:
1.
2.
3.
4.
Table-scan  filter x, y, z. Cost is B(R) = 200 since
R is clustered.
Use index on x =1  filter on y, z. Cost is 50 since
T(R) / V(R, x) is (5000/100) = 50 tuples, index is
not clustering.
Use index on y =2  filter on x, z. Cost is 10
since T(R) / V(R, y) is (5000/500) = 10 tuples
using nonclustering index.
Index-scan on clustering index w/ z < 5  filter x ,y.
Cost is about B(R)/3 = 67
Example (cont.)
395
Costs
option 1 = 200
option 2 = 50
option 3 = 10

option 4 = 67
The lowest Cost is option 3.

Therefore, the preferred physical plan
1.
retrieves all tuples with y = 2
2.
then filters for the rest two conditions (x, z).

•
We decide among the physical plans with which to implement a
given election by estimating the cost of reading data for each
possible option.
•
We shall count only the cost of accessing the data blocks, not
the index blocks.
II. Choosing a Join Method
396

Determine costs associated with each join
algorithms:
1. One approach is to call One-pass join, and nestedloop join devotes enough buffer to joining
2. An alternative is to choose Sort-join is preferred
when attributes are pre-sorted or two or more join on
the same attribute such as
(R(a, b)
S(a, c)) T(a, d)
- where sorting R and S on a will produce result of R
S to be sorted on a and used directly in next join
Choosing a Join Method (cont.)
397
3. Index-join for a join with high chance of using
index created on the join attribute such as R(a,
b)
S(b, c)
4. Hashing join is the best choice for unsorted or
non-indexing relations which needs multipass
join.
III. Pipelining Versus Materialization
398

Materialization (naïve way) to execute a query plan is to order
the operations appropriately and store the results of each
operation on disk until it is needed by another operation.

Pipelining (more efficient way)
•
More subtle way to execute a query plan is to interleave the
execution of several operations. The tuples produced by one
operation are passed directly to the operation that uses it,
without ever storing the intermediate tuples on disk. This
approach in called pipelining.
•
Since pipelining saves disk I/O’s, where is an obvious
advantage to pipelining, but there is a corresponding
disadvantage.
IV. Pipelining Unary Operations
399


Unary = a-tuple-at-a-time or full relation
selection and projection are the best
candidates for pipelining.
In buf
Unary
operation
Out buf
Unary
operation
Out buf
R
In buf
M-1 buffers
Pipelining Unary Operations (cont.)
400

Pipelining Unary Operations are implemented
by iterators
V. Pipelining Binary Operations
401

We use one buffer to pass the results to its consumer, one block at
a time.

The number of other buffers need to compute the results and to
consume the results varies, depending on the size of the result and
the sizes of other relations involved in the query.

Binary operations : ,  , - ,

The results of binary operations can also be pipelined.

Use one buffer to pass result to its consumer, one block at a time.

The extended example shows tradeoffs and opportunities
,x
Example
402


Consider physical query plan for the
expression
(R(w, x) S(x, y)) U(y, z)
Assumption
R occupies 5,000 blocks, S and U each 10,000
blocks.
 The intermediate result R S occupies k blocks for
some k.
 Both joins will be implemented as hash-joins, either
one-pass or two-pass depending on k
 There are 101 buffers available.

Example (cont.)
403
First consider join
R S, neither relations
fits in buffers
 Needs two-pass
hash-join to partition
R into 100 buckets
(maximum possible) each bucket has 50 blocks
 The 2nd pass hash-join uses 51 buffers,
leaving the rest 50 buffers for joining result of
R S with U.

Example (cont.)
404


Case 1: suppose k  49, the result of
S occupies at most 49 blocks.
Steps
1.
2.
3.
4.
R
Pipeline in R S into 49 buffers
Organize them for lookup as a hash table
Use one buffer left to read each block of U in
turn
Execute the second join as one-pass join.
Example (cont.)
405

The total number of
I/O’s is 55,000
 45,000
for two-pass
hash join of R and S
 10,000 to read U for
one-pass hash join of
(R S) U.
Example (cont.)
406

Case 2: suppose k > 49 but < 5,000, we can
still pipeline, but need another strategy which
intermediate results join with U in a 50-bucket,
two-pass hash-join. Steps are:
1.
Before start on R S, we hash U into 50 buckets of
200 blocks each.
Perform two-pass hash join of R and U using 51
buffers as case 1, and placing results in 50 remaining
buffers to form 50 buckets for the join of R S with U.
Finally, join R S with U bucket by bucket.
2.
3.
Example (cont.)
407

The number of disk I/O’s is:
 20,000
to read U and write its tuples into buckets
 45,000 for two-pass hash-join R
S
 k to write out the buckets of R
S
 k+10,000 to read the buckets of R
S and U in
the final join

The total cost is 75,000+2k.
Example (cont.)
408

Compare Increasing I/O’s between case 1 and
case 2
k
 49 (case 1)
 Disk
k
I/O’s is 55,000
> 50  5000 (case 2)
, I/O’s is 75,000+(2*50) = 75,100
 k=51 , I/O’s is 75,000+(2*51) = 75,102
 k=52 , I/O’s is 75,000+(2*52) = 75,104
 k=50
Notice: I/O’s discretely grows as k increases from 49 50.
Example (cont.)
409

1.
2.
Case 3: k > 5,000, we cannot perform twopass join in 50 buffers available if result of
R S is pipelined. Steps are
Compute R S using two-pass join and store
the result on disk.
Join result on (1) with U, using two-pass join.
Example (cont.)
410

The number of disk I/O’s is:
 45,000
for two-pass hash-join R and S
to store R S on disk
 30,000 + k for two-pass join of U in R
k

The total cost is 75,000+4k.
S
Example (cont.)
411

In summary, costs of physical plan as function
of R S size.
412
CHAPTER 18
CONCURRENCY CONTROL
18.1 – 18.2
Chiu Luk
CS257 Database Systems Principles
Spring 2009
CONCURRENCY CONTROL
413



Concurrency control in database management systems
(DBMS) ensures that database transactions are
performed concurrently without the concurrency violating
the data integrity of a database.
Executed transactions should follow the ACID rules. The
DBMS must guarantee that only serializable (unless
Serializability is intentionally relaxed), recoverable
schedules are generated.
It also guarantees that no effect of committed transactions
is lost, and no effect of aborted (rolled back) transactions
remains in the related database.
TRANSACTION ACID RULES
414
Atomicity - The outside world the transaction appears to be
indivisible, atomic.
Consistency - Every transaction must leave the database in a
consistent state.
Isolation - Transactions cannot interfere with each other. Providing
isolation is the main goal of concurrency control.
Durability - Successful transactions must persist through crashes.
SERIAL AND SERIALIZABLE
SCHEDULES
415
In the field of databases, a schedule is a list of actions, (i.e. reading, writing,
aborting, committing), from a set of transactions.
 In this example, Schedule D is the set of 3 transactions T1, T2, T3.
The schedule describes the actions of the transactions as seen by the DBMS.
 T1 Reads and writes to object X,
and then T2 Reads and writes to object Y, and finally T3 Reads and writes to object
Z.
The actions of the 3 transactions are not interleaved, it is said to be serial
schedules

Serial and Serializable Schedules
416

A schedule that is equivalent to a serial schedule has the serializability property.

In schedule E, the order in which the actions of the transactions are executed is not the
same as in D, but in the end, E gives the same result as D.
Serial Schedule TI precedes T2
417
T1
Read(A); A  A+100
Write(A);
Read(B); B  B+100;
Write(B);
T2
A
25
B
25
125
125
Read(A);A  A2;
Write(A);
250
Read(B);B  B2;
Write(B);
250
250
250
Serial Schedule T2 precedes Tl
418
T1
T2
A
25
B
25
Read(A);A  A2;
Write(A);
50
Read(B);B  B2;
50
Write(B);
Read(A); A  A+100
Write(A);
Read(B); B  B+100;
Write(B);
150
150
150
150
serializable, but not serial,
schedule
419
T1
Read(A); A  A+100
Write(A);
T2
A
25
B
25
125
Read(A);A  A2;
Write(A);
250
Read(B); B  B+100;
Write(B);
125
Read(B);B  B2;
Write(B);
r1(A); w1 (A): r2(A); w2(A); r1 (B); w1 (B); r2(B); w2(B);
250
250
250
nonserializable schedule
420
T1
Read(A); A  A+100
Write(A);
T2
A
25
B
25
125
Read(A);A  A2;
Write(A);
250
Read(B);B  B2;
50
Write(B);
Read(B); B  B+100;
Write(B);
250
150
150
schedule that is serializable only
because of the detailed behavior
of the transactions
A
B
421
T1
Read(A); A  A+100
Write(A);
T2’
25
25
125
Read(A);A  A1;
Write(A);
125
Read(B);B  B1;
25
Write(B);
Read(B); B  B+100;
Write(B);

regardless of the consistent initial state: the final state will be consistent.
125
125
125
Non-Conflicting Actions
422
Two actions are non-conflicting, whenever they
occur consecutively in a schedule, swapping
does not affect the final state produced by the
schedule. Otherwise, they are conflicting.
Conflicting Actions: General Rules
423

Two actions of the same transaction conflict: if
 r1(A)
w1(B), is that the order of actions of a
single transaction are fixed and may not be
reordered

Two actions over the same database element
conflict, if one of them is a write
 r1(A)
w2(A)
 w1(A) w2(A)
Conflict actions
424
Two or more actions are said to be in conflict if:


The actions belong to different transactions.

At least one of the actions is a write operation.

The actions access the same object (read or write).
The following set of actions is conflicting:


T1:R(X), T2:W(X), T3:W(X)
While the following sets of actions are not:


T1:R(X), T2:R(X), T3:R(X)

T1:R(X), T2:W(Y), T3:R(X)
Conflict Serializable
425
We may take any schedule and make as many
nonconflicting swaps as we wish.

With the goal of turning the schedule into a serial
schedule.

If we can do so, then the original schedule is
serializable, because its effect on the database
state remains the same as we perform each of the
nonconflicting swaps.

Conflict Serializable
426

A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to
one or more serial schedules.

Another definition for conflict-serializability is that a schedule is conflict-serializable if
and only if there exists an acyclic precedence graph/serializability graph for the
schedule.

Which is conflict-equivalent to the serial schedule <T1,T2>, but not <T2,T1>.
Conflict equivalent / conflictserializable
427

Let Ai and Aj are consecutive non-conflicting actions
that belongs to different transactions. We can swap Ai
and Aj without changing the result.

Two schedules are conflict equivalent if they can be
turned one into the other by a sequence of nonconflicting swaps of adjacent actions.

We shall call a schedule conflict-serializable if it is
conflict-equivalent to a serial schedule.
Conflict equivalent / conflictserializable
428


Note that conflict-serializable is a sufficient
condition for serilizability. i. e. a conflictserilizable schedule is a serializable
schedule.
conflict-serializability is not required for a
schedule to be serializable schdule.
conflict-serializable
429
T1
R(A)
W(A)
T2
R(A)
R(B)
W(A)
W(B)
R(B)
W(B)
430
conflict-serializable
T1
R(A)
W(A)
R(B)
T2
R(A)
W(A)
W(B)
R(B)
W(B)
431
conflict-serializable
T1
R(A)
W(A)
R(A)
T2
R(B)
W(B)
W(A)
R(B)
W(B)
conflict-serializable
T1
R(A)
W(A)
R(A)
W(B)
T2
Serial
Schedule
R(B)
W(A)
R(B)
W(B)
Locks
Scheduler that uses a lock table to help perform
its job.
It works like as follows :





It is the responsibility of the scheduler is to take a
request from transaction .or allow them to operate on the
database or block them.
Scheduler checks in the lock table
Generates a serializable schedule of actions.
433
Consistency of transactions

Actions and locks must relate each other
 Transactions
can only read & write only if has a
lock and has not released the lock on that
element and hasn’t released the lock.
 Unlocking an element is compulsory.

Legality of schedules
 No
two transactions may have locked the
same element without one having the first
released the lock.
434
Locking scheduler



Grants lock requests only if it is in a legal
schedule.
If a request is not granted m the requesting
transaction is delayed; it waits until the
scheduler grants its request at a later time.
Lock table stores the information about current
locks on the elements.
435
The locking scheduler (contd.)

A legal schedule of consistent transactions but
unfortunately it is not a serializable.
436
Locking schedule (contd.)

437
The locking scheduler delays requests that
would result in an illegal schedule.
Two-phase locking



Guarantees a legal schedule of consistent
transactions is conflict-serializable.
All lock requests proceed all unlock requests.
The growing phase:
 Obtain

all the locks and no unlocks allowed.
The shrinking phase:
 Release
438
all the locks and no locks allowed.
Working of Two-Phase locking


Can guarantee that a legal schedule of
consistent tran serializability.
Two protocols for 2PL:
 Strict
two phase locking : Transaction holds all
its exclusive locks till commit / abort.
 Rigorous two phase locking : Transaction holds
all locks till commit / abort.

439
Possible to find a transaction Tj that has a 2PL
and a schedule S for Ti ( non 2PL ) and Tj that
is not conflict serializable.
Failure of 2PL.

440
2PL fails to provide security against deadlocks.
Locking Systems with Several Lock
Modes





441
Locking Scheme
 Shared/Read Lock ( For Reading)
 Exclusive/Write Lock( For Writing)
Compatibility Matrices
Upgrading Locks: where transactions are allowed to
take a shared lock and “upgrade” it to an exclusive
lock” or “write lock”
Update Locks
Increment Locks: which treat specially write actions
that increments a database element
Shared & Exclusive Locks


Need stronger lock for writing then reading.
Consistency of Transactions
Cannot write without Exclusive Lock
 Cannot read without holding some lock


This basically works on 2 principles
A read action can only proceed a shared or an
exclusive lock
 A write lock can only proceed a exclusice lock


442
All locks need to be unlocked before commit
Shared and exclusive locks (cont.)

Two-phase locking of transactions


Legality of Schedules

443
Must precede unlocking
An element may be locked exclusively by one
transaction or by several in shared mode, but not both.
Compatibility Matrices


C. M. is a convenient way to describe lockmanagement policies.
Has a row and column for each lock mode.
 Rows
correspond to a lock held on an element by
another transaction
 Columns correspond to mode of lock requested.
 Example :
LOCK REQUESTED
444
S
X
LOCK
S
YES
NO
HOLD
X
NO
NO
Upgrading Locks

Suppose a transaction wants to read as well
as write :
 It
aquires a shared lock on the element
 Performs the calculations on the element
 And when its ready to write, It is granted a
exclusive lock.

445
Transactions with unpredicted read write locks
can use UPGRADING LOCKS.
Upgrading locks (cont.)


446
Indiscriminating use of upgrading produces a
deadlock.
Example : Both the transactions want to
upgrade on the same element
Update locks





447
Solves the deadlock occurring in upgrade lock
method.
An update lock gives transaction only the
privilege to read X, not to write X
A transaction in an update lock can read but cant
write.
Update lock can later be converted to exclusive
lock.
An update lock can only be given if the element has
shared locks.
Update locks (cont.)

An update lock is like a shared lock when you are requesting
it and is like a exclusive lock when you have it.


But once there is an update lock on X we prevent
additional locks of any kind shared, update, or exclusive
from being taken on x..
Notation: uli (X)

Compatibility matrix :
448
S
X
U
S
YES
NO
YES
X
NO
NO
NO
U
NO
NO
NO
Increment Locks


449
Used for incrementing & decrementing stored
values.
E.g. - Transfer money from one bank to
another, Ticket selling transactions in which
number seats are decremented after each
transaction.
Increment lock (cont.)



450
A increment lock does not enable read or write locks on
element.
Any number of transaction can hold increment lock on
element
Shared and exclusive locks can not be granted if an
increment lock is granted on element
S
X
I
S
YES
NO
NO
X
NO
NO
NO
I
NO
NO
YES
451
18.4 LOCKING SYSTEMS WITH
SEVERAL LOCK MODES
CS257 Spring/2009
Professor: Tsau Lin
Student: Suntorn Sae-Eung
ID: 212
18.4 LOCKING SYSTEMS WITH
SEVERAL LOCK MODES

In 18.3, if a transaction must lock a database element (X)
either reads or writes,
 No reason why several transactions could not read X at the
same time, as long as none write X

Introduce locking schemes
 Shared/Read Lock ( For Reading)
 Exclusive/Write Lock( For Writing)

Upgrading Locks: where transactions are allowed to take a
shared lock and “upgrade” it to an exclusive lock” or “write
lock”

Update Locks
Increment Locks: which treat specially write actions that
452 increments a database element

18.4.1 SHARED & EXCLUSIVE LOCKS

Transaction Consistency
Cannot write without Exclusive Lock
 Cannot read without holding some lock



Consider lock for writing is “stronger” than for reading
This basically works on 2 principles
1. A read action can only proceed a shared or an
exclusive lock
2. A write lock can only proceed a exclusive lock

453
All locks need to be unlocked before commit
18.4.1 SHARED & EXCLUSIVE LOCKS
(CONT.)

Two-phase locking (2PL) of transactions
Ti

Lock  R/W  Unlock
Notation:
sli (X)– Ti requests shared lock on DB element X
xli (X)– Ti requests exclusive lock on DB element X
ui (X)– Ti relinquishes whatever lock on X
454
18.4.1 SHARED & EXCLUSIVE LOCKS
(CONT.)
455

Legality of Schedules

An element may be locked by: one write transaction or
by several read transactions shared mode, but not both
18.4.2 COMPATIBILITY MATRICES

A convenient way to describe lockmanagement policies
 Rows
correspond to a lock held on an element by
another transaction
 Columns correspond to mode of lock requested.
 Example :
Lock requested
Lock in
hold
456
S
X
S
YES
NO
X
NO
NO
18.4.3 UPGRADING LOCKS

A transaction (T) taking a shared lock is friendly toward other
transaction.

When T wants to read and write a new value X,
1. T takes a shared lock on X.
2. performs operations on X (may spend long time)
3. When T is ready to write a new value, “Upgrade” shared lock to
exclusive lock on X.

Suppose a transaction wants to read as well as write :
 It aquires a shared lock on the element
 Performs the calculations on the element
 And when its ready to write, It is granted a exclusive lock.
Transactions with unpredicted read write locks can use
457 UPGRADING LOCKS.

18.4.3 UPGRADING LOCKS (CONT.)
458

Observe the example
T1 retry and
succeed

‘B’ is released
T1 cannot take an exclusive lock on B until all locks on B are
released.
18.4.3 Upgrading Locks (cont.)

Upgrading can simply cause a “Deadlock”.
 Both
the transactions want to upgrade on the
same element
Both transactions will wait forever !!
459
18.4.4 Update locks

The third lock mode resolving the deadlock problem, which
rules are
 Only “Update lock” can be upgraded to a write (exclusive)
lock later.
 An “Update lock” is allowed to grant on X when there are
already shared locks on X.
 Once there is an “Update lock,” it prevents additional any
kinds of lock, and later changes to a write (exclusive) lock.
 An update lock gives transaction only the privilege to
read X, not to write X

Notation: uli (X)
460
18.4.4 Update locks (cont.)
461

Example
18.4.4 Update locks (cont.)
• Compatibility matrix (asymmetric)
Lock requested
Lock in
hold
462
S
X
U
S
YES
NO
YES
X
NO
NO
NO
U
NO
NO
NO
18.4.5 Increment Locks

A useful lock for transactions which
increase/decrease value.
e.g. money transfer between two bank accounts.

If 2 transactions (T1, T2) add constants to the
same database element (X),



463
It doesn’t matter which goes first, but no reads are
allowed in between transaction processing
Let see on following exhibits
Shared and exclusive locks can not be
granted if an increment lock is granted on
element
18.4.5 Increment Locks (cont.)
464
CASE 1
T1: INC (A,2)
A=7
A=5
T2: INC (A,10)
CASE 2
T2: INC (A,10)
A=17
A=15
T1: INC (A,2)
18.4.5 Increment Locks (cont.)
465

What if
T1: INC (A,2)
A=5
T2: INC (A,10)
A=15
A=5
T2: INC (A,10)
A=5
A=7
A=5
A=15
A != 17
T1: INC (A,2)
A=7
18.4.5 Increment Locks (cont.)
466

INC (A, c) –
 Increment
action of writing on database element
A, which is an atomic execution consisting of
1. READ(A,t);
2. t = t+c;
3. WRITE(A,t);

Notation:
 ili (X)–
action of Ti requesting an increment lock on X
 inci (X)– action of Ti increments X by some constant;
don’t care about the value of the constant.
18.4.5 Increment Locks (cont.)
467

Example
18.4.5 Increment Locks (cont.)
• Compatibility matrix
Lock requested
Lock in
hold
468
S
X
I
S
YES
NO
NO
X
NO
NO
NO
I
NO
NO
YES
Concurrency Control
Section 18.5
469
Overview

Assume knowledge of:
Lock
 Two phase lock
 Lock modes: shared, exclusive, update


A simple scheduler architecture based on
following principle :
Insert lock actions into the stream of reads,
writes, and other actions
 Release locks when the transaction manager tells
it that the transaction will commit or abort

470
471
Scheduler That Inserts Lock
Actions into the transactions
request stream
Scheduler That Inserts Lock
Actions
If transaction is delayed, waiting for a lock,
Scheduler performs following actions
 Part I: Takes the stream of requests generated
by the transaction & insert appropriate lock
modes to db operations (read, write, or update)
 Part II: Take actions (a lock or db operation) from
Part I and executes it.
 Determine the transaction (T) that action belongs
and status of T (delayed or not). If T is not
delayed then
1. Database access action is transmitted to the
database and executed
472
Scheduler That Inserts Lock
Actions
2.
3.
3.
473
If lock action is received by PartII, it checks the L
Table whether lock can be granted or not
i> Granted, the L Table is modified to include granted
lock
ii>Not G. then update L Table about requested lock
then PartII delays transaction T
When a T = commits or aborts, PartI is notified by the
transaction manager and releases all locks.
If any transactions are waiting for locks PartI
notifies PartII.
Part II when notified about the lock on some DB
element, determines next transaction T’ to get lock to
The Lock Table
A relation that associates database elements
with locking information about that element
 Implemented with a hash table using
database elements as the hash key
 Size is proportional to the number of lock
elements only, not to the size of the entire
DBdatabase

element A
Lock
informatio
n for A
474
Lock Table Entries Structure
475
Some Sort of
information found in
Lock Table entry
1>Group modes
-S: only shared locks are
held
-X: one exclusive lock and
no other locks
- U: one update lock and
one or more shared locks
2>wait : one transaction
waiting for a lock on A
3>A list : T currently
hold locks on A or
Waiting for lock on A
Handling Lock Requests



476
Suppose transaction T requests a lock on A
If there is no lock table entry for A, then there
are no locks on A, so create the entry and
grant the lock request
If the lock table entry for A exists, use the
group mode to guide the decision about the
lock request
Handling Lock Requests
If group mode is U (update) or X (exclusive)
No other lock can be granted
 Deny the lock request by T
 Place an entry on the list saying T requests a lock
 And Wait? = ‘yes’
 If group mode is S (shared)
Another shared or update lock can be granted
 Grant request for an S or U lock
 Create entry for T on the list with Wait? = ‘no’
 Change group mode to U if the new lock is an update
lock

477
Handling Unlock Requests




478
Now suppose transaction T unlocks A
Delete T’s entry on the list for A
If T’s lock is not the same as the group mode,
no need to change group mode
Otherwise check entire list for new group
mode
 S: GM(S) or nothing
 U: GM(S) or nothing
 X: nothing
Handling Unlock Requests
 If the value of waiting is “yes" need to grant one or more locks using
following approaches
First-Come-First-Served:
 Grant the lock to the longest waiting request.
 No starvation (waiting forever for lock)
Priority to Shared Locks:
 Grant all S locks waiting, then one U lock.
 Grant X lock if no others waiting
Priority to Upgrading:
 If there is a U lock waiting to upgrade to an X lock, grant that first.
479
Managing Hierarchies of Database Elements (18.6)
Concurrency Control
480
Managing Hierarchies of
Database Elements
481


We shall focus on two problems that
arise with locks when there is a tree
structure to the data are:
When the tree structure is a hierarchy of
lockable elements


Determine how locks are granted for both large
elements (relations) and smaller elements (blocks
containing tuples or individual tuples)
When the data itself is organized as a tree
(B-tree indexes)

This will be discussed in the next section
Locks with Multiple Granularity
482



A database element can be a relation, block or
a tuple
Different systems use different database
elements to determine the size of the lock
Thus some may require small database
elements such as tuples or blocks and
relations
483
Example of Multiple Granularity
Locks

Consider a database for a bank
Choosing relations as database elements means
we would have one lock for an entire relation
 If we were dealing with a relation having account
balances, this kind of lock would be very inflexible
and thus provide very little concurrency
 Why? because balance transactions require
exclusive locks and this would mean only one
transaction occurs for one account at any time
 But as each account is independent of others we
could perform transactions on different accounts
simultaneously

…(contd.)
484


Thus it makes sense to have block element for the
lock so that two accounts on different blocks can
be updated simultaneously
Another example is that of a document
With similar arguments as above, we see that it is
better to have large element (a complete
document) as the lock in this case
 These documents may be edited from time to
time, but most transactions will retrieve whole
documents .
 The sensible choice of database element is a
complete document.

Warning (Intention) Locks
485



The solution to the problem of managing
locks at different granularities involves a
new kind of lick called a “warning”.
The rules for managing locks on hierarchy
of database elements constitute the
warning protocol
These are required to manage locks at
different granularities

In the bank example, if the a shared lock is obtained
for the relation while there are exclusive locks on
individual tuples, unserializable behavior occurs
486
Database Elements Organized in
Hierarchy
Rules of Warning Protocol
487



The warning locks will be denoted as I.
These involve both ordinary (S and X) and
warning (IS and IX) locks
The rules are:
Begin at the root of hierarchy
 Request the S/X lock if we are at the desired element
 If the desired element id further down the hierarchy,
place a warning lock (IS if S and IX if X)
 When the warning lock is granted, we proceed to the
child node and repeat the above steps until desired
node is reached

488
Compatibility Matrix for Shared,
Exclusive and Intention Locks
IS
IX
S
X
IS
Yes
Yes
Yes
No
IX
Yes
Yes
No
No
S
Yes
No
Yes
No
X
No
No
No
No
• The above matrix applies only to locks held by
other transactions
Group Modes of Intention Locks
489



An element can request S and IX locks at the
same time if they are in the same transaction
(to read entire element and then modify sub
elements)
This can be considered as another lock mode,
SIX, having restrictions of both the locks i.e.
No for all except IS
SIX serves as the group mode
Example
490

Consider a transaction T1 as follows
Select * from table where attribute1 = ‘abc’
 Here, IS lock is first acquired on the entire relation;
then moving to individual tuples (attribute = ‘abc’), S
lock in acquired on each of them


Consider another transaction T2
Update table set attribute2 = ‘def’ where attribute1 =
‘ghi’
 Here, it requires an IX lock on relation and since T1’s
IS lock is compatible, IX is granted

491
 On
reaching the desired tuple (ghi), as there is no
lock, it gets X too
 If T2 was updating the same tuple as T1, it would
have to wait until T1 released its S lock
Phantoms and Handling
Insertions Correctly
492



This arises when transactions create new sub
elements of lockable elements
Since we can lock only existing elements the
new elements fail to be locked
The problem created in this situation is
explained in the following example
Example
493

Consider a transaction T3
 Select
sum(length) from table where attribute1 =
‘abc’
 This calculates the total length of all tuples having
attribute1
 Thus, T3 acquires IS for relation and S for
targeted tuples
 Now, if another transaction T4 inserts a new tuple
having attribute1 = ‘abc’, the result of T3 becomes
incorrect
Example (…contd.)
494


This is not a concurrency problem , since the
serial order (T3, T4) is maintained
But if both T3 and T4 were to write an element
X, it could lead to unserializable behavior
r3(t1);r3(t2);w4(t3);w4(X);w3(L);w3(X)
 r3 and w3 are read and write operations by T3 and w4
are the write operations by T4 and L is the total length
calculated by T3 (t1 + t2)
 At the end, we have result of T3 as sum of lengths of
t1 and t2 and X has value written by T3
 This is not right; if value of X is considered to be that
written by T3 then for the schedule to be serializable,
the sum of lengths of t1, t2 and t3 should be
considered

Example (…contd.)
495

Else if the sum is total length of t1 and t2 then for the
schedule to be serializable, X should have value
written by T4

This problem arises since the relation has a phantom tuple
(the new inserted tuple), which should have been locked but
wasn’t since it didn’t exist at the time locks were taken

We must regard the insertion or deletion of a tuple as a
write operation on the relation as a whole.

Thus transaction must obtain an X lock on the relation
Movie.

The occurrence of phantoms can be avoided if all insertion
and deletion transactions are treated as write operations on
the whole relation
496
CONCURRENCY
CONTROL
SECTION 18.7
THE TREE PROTOCOL
BASICS
497
B-Trees
- Tree data structure that keeps data sorted
- allow searches, insertion, and deletion
- commonly used in database and file systems
 Lock
- Enforce limits on access to resources
- way of enforcing concurrency control
 Lock Granularity
- Level and type of information that lock
protects.

TREE PROTOCOL
498

Tree structures that are formed
by the link pattern of the
elements themselves. Database
are the disjoint pieces of data,
but the only way to get to Node is
through its parent.

Kind of graph-based protocol

Alternate to Two-Phased Locking (2PL)

database elements are disjoint pieces of
data

Nodes of the tree DO NOT form a
hierarchy based on containment

Way to get to the node is
parent
through its
ADVANTAGES OF TREE
PROTOCOL
499

Unlocking takes less time as
compared to 2PL

Freedom from deadlocks
18.7.1 MOTIVATION FOR
TREE-BASED LOCKING
500



Consider B-Tree Index, treating individual nodes
as lockable database elements.
Concurrent use of B-Tree is not possible with
standard set of locks and 2PL.
Therefore, a protocol is needed which can assure
serializability by allowing access to the elements
all the way at the bottom of the tree even if the
2PL is violated.
501
18.7.1 MOTIVATION FOR
TREE-BASED LOCKING (CONT.)
Reason for : “Concurrent use of B-Tree is not
possible with standard set of locks and 2PL.”
every transaction must begin with locking
the root node
 2PL transactions can not unlock the root
until all the required locks are acquired.

502
18.7.2 ACCESSING TREE
STRUCTURED DATA
Assumptions:
Only
one kind of lock
Consistent transactions
Legal schedules
No 2PL requirement on transaction
503
18.7.2 RULES FOR ACCESSING
TREE STRUCTURED DATA
RULES:

First lock can be at any node.

Subsequent locks may be acquired only after parent node has a lock.

Nodes may be unlocked any time.





No relocking of the nodes even if the node’s parent is still locked
There are few restrictions in locks from the tree protocol.
We assume that that there are only one kind of lock.
Transaction is consider a legal and schedules as simple.
Expected restrictions by granting locks only when they do not
conflict with locks already at a node, but there is no two phase
locking requirement on transactions.
18.7.3 WHY TREE PROTOCOL
WORKS?
504
 A transaction's first lock may be at any node of the tree.



Subsequent locks may only be acquired if the transaction currently has a
lock on the parent node.
Nodes may be unlocked at any time
A transaction may not relock a node on which it has released a lock, even
if it still holds a lock on the node’s parent

Tree protocol implies a serial order on transactions in the schedule.

Order of precedence:
Ti < s Tj

If Ti locks the root before Tj, then Ti locks every node in common
before Tj.
with Tj
ORDER OF PRECEDENCE
505
506
CONCURRENCY
CONTROL
SECTION 18.8
TIMESTAMPS
What is Timestamping?
507


Scheduler assign each transaction T a unique
number, it’s timestamp TS(T).
Timestamps must be issued in ascending
order, at the time when a transaction first
notifies the scheduler that it is beginning.
Timestamp TS(T)
508

Two methods of generating Timestamps.
 Use
the value of system, clock as the timestamp.
 Use a logical counter that is incremented after a
new timestamp has been assigned.

Scheduler maintains a table of currently active
transactions and their timestamps irrespective
of the method used
Timestamps for database
element X and commit bit
509



RT(X):- The read time of X, which is the highest
timestamp of transaction that has read X.
WT(X):- The write time of X, which is the highest
timestamp of transaction that has write X.
C(X):- The commit bit for X, which is true if and
only if the most recent transaction to write X has
already committed.
Physically Unrealizable Behavior
510

Read too late:
A transaction U that started after transaction T,
but wrote a value for X before T reads X.
U writes X
T reads X
T start
U start
Physically Unrealizable Behavior
511

Write too late
A transaction U that started after T, but read X
before T got a chance to write X.
U reads X
T writes X
T start
U start
Figure: Transaction T tries to write too late
Dirty Read
512

It is possible that after T reads the value of X
written by U, transaction U will abort.
U writes X
T reads X
U start
T start
U aborts
T could perform a dirty read if it reads X when shown
Rules for Timestamps-Based
scheduling
513
1. Scheduler receives a request rT(X)
a) If TS(T) ≥ WT(X), the read is physically
realizable.
1. If C(X) is true, grant the request, if TS(T) > RT(X),
set RT(X) := TS(T); otherwise do not change RT(X).
2. If C(X) is false, delay T until C(X) becomes true or
transaction that wrote X aborts.
b) If TS(T) < WT(X), the read is physically
unrealizable. Rollback T.
514
Rules for Timestamps-Based
scheduling (Cont.)
2. Scheduler receives a request WT(X).
a) if TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is physically
realizable and must be performed.
1. Write the new value for X,
2. Set WT(X) := TS(T), and
3. Set C(X) := false.
b) if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is
physically realizable, but there is already a later values in X.
a. If C(X) is true, then the previous writers of X is
committed, and ignore the write by T.
b. If C(X) is false, we must delay T.
c) if TS(T) < RT(X), then the write is physically unrealizable,
and T must be rolled back.
Rules for Timestamps-Based
scheduling (Cont.)
515
1.
2.
3.
Granting Request
Aborting T (if T would violate physical reality) and restarting T
with a new timestamp (Rollback)
Delaying T and later deciding whether to abort T or to grant
the request
3. Scheduler receives a request to commit T. It must find
all the database elements X written by T and set C(X) :=
true. If any transactions are waiting for X to be
committed, these transactions are allowed to proceed.
4. Scheduler receives a request to abort T or decides to
rollback T, then any transaction that was waiting on an
element X that T wrote must repeat its attempt to read or
write.
Multiversion Timestamps
516





keeps old versions of data item to increase
concurrency.
Each successful write results in the creation of a
new version of the data item written.
successful write results in the creation of a
new version of the data item written.
Use timestamps to label versions.
When a read(X) operation is issued, select
an appropriate version of X based on the
timestamp of the transaction, and return the
value of the selected version.
Timestamps and Locking
517

Generally, timestamping performs better than
locking in situations where:
 Most
transactions are read-only.
 It is rare that concurrent transaction will try to
read and write the same element.

In high-conflict situation, locking performs better
than timestamps

In high-conflict situations, rollback will be
frequent, introducing more delays than a
locking system
Validation based scheduling
518


Scheduler keeps a record of what the active
transactions are doing.
Executes in 3 phases
1.
2.
3.
Read- reads from RS( ), computes local address
Validate- compares read and write sets
Write- writes from WS( )
Validation based Scheduler
519




Checks are made to ensure serializability is not violated
Scheduling of transactions is done by assigning transaction
numbers to each transactions
There must exist a serial schedule in which transaction Ti
comes before transaction Tj whenever t(i) < t(j)
If validation fails then the transaction is rolled back
otherwise it proceeds to the third phase

Contains an assumed serial order of transactions.

Maintains three sets:
 START( ): set of T’s started but not completed validation.
 VAL( ): set of T’s validated but not finished the writing phase.
 FIN( ): set of T’s that have finished.
Expected exceptions
520
1. Suppose there is a transaction U, such that:
 U is in VAL or FIN; that is, U has validated,
 FIN(U)>START(T); that is, U did not finish before T started
 RS(T) ∩WS(T) ≠φ; let it contain database element X.
2. Suppose there is transaction U, such that:
• U is in VAL; U has successfully validated.
•FIN(U)>VAL(T); U did not finish before T entered its validation phase.
•WS(T) ∩ WS(U) ≠φ; let x be in both write sets.
Validation rules
521



T4 Starts before T1 and T3 finishes. So T4
has to be checked against the sets of T1 and
T3
Check that RS(T) ∩ WS(U)= φ for any
previously validated U that did not finish before
T has started i.e. FIN(U)>START(T).
Check that WS(T) ∩ WS(U)= φ for any
previously validated U that did not finish before
T is validated i.e. FIN(U)>VAL(T)
Example
522
Solution
523


Validation of U:
Nothing to check
Validation of T:
WS(U) ∩ RS(T)= {D} ∩{A,B}=φ
WS(U) ∩ WS(T)= {D}∩ {A,C}=φ

Validation of V:
RS(V) ∩ WS(T)= {B}∩{A,C}=φ
WS(V) ∩ WS(T)={D,E}∩ {A,C}=φ
RS(V) ∩ WS(U)={B} ∩{D}=φ

Validation of W:
RS(W) ∩ WS(T)= {A,D}∩{A,C}={A}
WS(W) ∩ WS(V)= {A,D}∩{D,E}={D}
WS(W) ∩ WS(V)= {A,C}∩{D,E}=φ
(W is not validated)
Comparison
524
Concurrency
control
Mechanisms
Storage Utilization
Delays
Locks
Space in the lock table is
proportional to the number of
database elements locked.
Delays transactions but
avoids rollbacks
Timestamps
Space is needed for read and
write times with every
database element, neither or
not it is currently accessed.
Do not delay the
transactions but cause
them to rollback unless
Interface is low
Validation
Space is used for timestamps
and read or write sets for
each currently active
transaction, plus a few more
transactions that finished
after some currently active
transaction began.
Do not delay the
transactions but cause
them to rollback unless
interface is low
525
CHAPTER 21
21.1 INTRODUCTION TO
INFORMATION INTEGRATION
CS257 Fan Yang
Need for Information Integration
526



Databases are created independently, even
if they later need to work together.
All the data in the world could put in a single
database (ideal database system)
In the real world (impossible for a single
database):
databases are created independently
hard to design a database to support future
use
University Database
527



•
Registrar Database for keeping data about
courses and student grades for generating
transcripts.
Bursar Database for keeping data about the
tuition payments by students.
Human Resource Department Database for
recording employees including those students
with teaching assistantship jobs.
Other department….
Inconvenient
528

But these databases independently were of no use as a change in 1 database would
not reflect in the other database which had to be performed manually. For e.g. we
want to make sure that Registrar does not record grades of the student who did not
pay the fees at Bursars office.

Building a whole new database for the system again is a very expensive and time
consuming process.

In addition to paying for a very expensive software the University will have to run both
the old and the new databases together for a long time to see that the new system
works properly or not

Record grades for students who pay tuition

Want to swim in SJSU aquatic center for free in summer vacation?
(all the cases above cannot achieve the function by a single database)

Solution: one database
How to integrate
529


Start over
build one database: contains all the legacy
databases; rewrite all the applications on top
of all legacy databases, without disturbing the
original databases. result: painful
Build a layer of abstraction (middleware)
on top of all the legacy databases
this layer is often defined by a collection of
classes BUT…
Heterogeneity Problem
530
What is Heterogeneity Problem
Aardvark Automobile Co.
1000 dealers has 1000 databases
to find a model at another dealer
can we use this command:
SELECT * FROM CARS
WHERE MODEL=“A6”;

Type of Heterogeneity
531






Communication Heterogeneity
Query-Language Heterogeneity
Schema Heterogeneity
Data type difference
Value Heterogeneity
Semantic Heterogeneity
532
CHAPTER 21.2
MODES OF INFORMATION
INTEGRATION
ID: 219
Name: Qun Yu
Class: CS257 219 Spring 2009
Instructor: Dr. T.Y.Lin
Content Index
533
21.2 Modes of Information Integration
Information Integration allows database or other
distributed information to work together.
21.2.1 Federated Database Systems
21.2.2 Data Warehouses
21.2.3 Mediators
Federations
534
 The simplest architecture for integrating several
DBs
 One to one connections between all pairs of
DBs
 n DBs talk to each other, n(n-1) wrappers are
needed
 Good when communications between DBs are
limited
Wrapper
535

Wrapper : a software translates incoming queries
and outgoing answers. In a result, it allows
information sources to conform to some shared
schema.
Federations Diagram
536
DB1
DB2
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
DB3
DB4
A federated collection of 4 DBs needs 12 components to translate queries
from one to another.
Example
537
Car dealers want to share their inventory. Each dealer queries the
other’s DB to find the needed car.
Dealer-1’s DB relation: NeededCars(model,color,autoTrans)
Dealer-2’s DB relation: Auto(Serial, model, color)
Options(serial,option)
wrapper
Dealer-1’s DB
wrapper
Dealer-2’s DB
Example…
538
For(each tuple(:m,:c,:a) in NeededCars){
if(:a=TRUE){/* automatic transmission wanted */
SELECT serial
FROM Autos, Options
WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’
AND Autos.model = :m AND Autos.color =:c;
}
Else{/* automatic transmission not wanted */
SELECT serial
FROM Auto
WHERE Autos.model = :m AND
Autos.color = :c AND
NOT EXISTS( SELECT * FROM Options WHERE serial = Autos.serial
AND option=‘autoTrans’);
}
}
Dealer 1 queries Dealer 2 for needed cars
Data Warehouse
539

Sources are translated from their local schema to a global schema
and copied to a central DB.

User transparent: user uses Data Warehouse just like an ordinary
DB

User is not allowed to update Data Warehouse

Copies sources of data from several sources are
stored in a single database.
Warehouse Diagram
540
User
query
result
Warehouse
Combiner
Extractor
Extractor
Source 1
Source 2
Example
541
Construct a data warehouse from sources DB of 2 car dealers:
Dealer-1’s schema: Cars(serialNo, model,color,autoTrans,cdPlayer,…)
Dealer-2’s schema: Auto(serial,model,color)
Options(serial,option)
Warehouse’s schema:
AutoWhse(serialNo,model,color,autoTrans,dealer)
Extractor --- Query to extract data from Dealer-1’s data:
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,autoTrans,’dealer1’
From Cars;
Example
542
Extractor --- Query to extract data from Dealer-2’s data:
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,’yes’,’dealer2’
FROM Autos,Options
WHERE Autos.serial=Options.serial AND
option=‘autoTrans’;
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)
SELECT serialNo,model,color,’no’,’dealer2’
FROM Autos
WHERE NOT EXISTS ( SELECT * FROM serial =Autos.serial
AND option = ‘autoTrans’);
Construct Data Warehouse
543
There are mainly 3 ways to constructing
the data in the warehouse:
1)
Periodically reconstructed from the current data in the
sources, once a night or at even longer intervals.
Advantages:
simple algorithms.
Disadvantages:
1) need to shut down the warehouse;
2) data can become out of date.
Construct Data Warehouse
544
2)
Updated periodically based on the changes(i.e. each
night) of the sources.
Advantages:
involve smaller amounts of data. (important when warehouse is
large and needs to be modified in a short period)
Disadvantages:
1) the process to calculate changes to the warehouse is complex.
2) data can become out of date.
Construct Data Warehouse
545
3)
Changed immediately, in response to each change or a
small set of changes at one or more of the sources.
Advantages:
data won’t become out of date.
Disadvantages:
requires too much communication, therefore, it is
generally too expensive.
(practical for warehouses whose underlying sources changes
slowly.)
Mediators
546



Virtual warehouse, which supports a virtual view
or a collection of views, that integrates several
sources.
Mediator doesn’t store any data.
Mediators’ tasks:
1)receive user’s query,
2)send queries to wrappers,
3)combine results from wrappers,
4)send the final result to user.
A Mediator diagram
547
Result
User query
Mediator
Query
Result
Result
Wrapper
Query
Result
Source 1
Query
Wrapper
Query
Result
Source 2
Example
548
Same data sources as the example of data warehouse, the mediator
Integrates the same two dealers’ source into a view with schema:
AutoMed(serialNo,model,color,autoTrans,dealer)
When the user have a query:
SELECT sericalNo, model
FROM AkutoMed
Where color=‘red’
Example
549
In this simple case, the mediator forwards the same query to each
Of the two wrappers.
Wrapper1: Cars(serialNo, model, color, autoTrans, cdPlayer, …)
SELECT serialNo,model
FROM cars
WHERE color = ‘red’;
Wrapper2: Autos(serial,model,color); Options(serial,option)
SELECT serial, model
FROM Autos
WHERE color=‘red’;
The mediator needs to interprets serial into serialNo, and then
returns the union of these sets of data to user.
Example
550
There may be different options for the mediator to forward user query,
for example, the user queries if there are a specific model&color car
(i.e. “Gobi”, “blue”).
The mediator decides 2nd query is needed or not based on the result of
1st query. That is, If dealer-1 has the specific car, the mediator doesn’t
have to query dealer-2.
551
21.3 WRAPPERS IN MEDIATORBASED SYSTEMS
Presentby: Kai Zhu
Professor: Dr. T.Y. Lin
552
Wrappers in Mediator-based Systems



More complicated than that in most data
warehouse system.
Able to accept a variety of queries from the
mediator and translate them to the terms of the
source.
Communicate the result to the mediator.
553
How to design a wrapper?
Classify the possible queries that the mediator can ask into
templates, which are queries with parameters that represent
constants.




The wrapper(extractor) consists of:
One or more predefined queries (based on source)
 SQL
 Web page
Suitable communication mechanism for sending and
receiving information to/from
source/mediator.
554
Templates for Query Patterns:

Use notation T=>S to express the idea
that the template T is turned by the
wrapper into the source query S.
555

Example 1
Dealer 1
Cars (serialNo, model, color, autoTrans,
navi,…)
For use by a mediator with schema
AutoMed (serialNo, model, color, autoTrans,
dealer)

We denote the code representing that color by
the parameter $c, then the template will be:
556
SELECT *
FROM AutosMed
WHERE color = ’$c’;
=>
SELECT serialNo, model, color, autoTrans, ’dealer1’
FROM Cars
WHERE color=’$c’;
(Template T => Source query S)
557

There will be total 2n templates if we have the
option of specifying n attributes.
558
Wrapper Generators

The wrapper generator creates a table holds the various query
patterns contained in the templates.

The source queries that are associated with each.

Creates a table that holds the various query patterns
contained in templates.


Source queries associated with each of them.
The Driver:

Accept a query from the mediator.

Search the table for a template that matches the query.

Send the query to the source.

Return the response to the Mediator.
A driver is used in each wrapper, the task of
the driver is to:
559




Accept a query from the mediator.
Search the table for a template that matches the
query.
The source query is sent to the source, again
using a “plug-in” communication mechanism.
The response is processed by the wrapper.
560
Filter

Have a wrapper filter to supporting more
queries.


561
Example 2
If wrapper is designed with more complicated
template with queries specify both model and
color.
SELECT *
FROM AutosMed
WHERE model = ’$m’ AND color = ’$c’;
=>
SELECT serialNo, model, color, autoTrans, ’dealer1’
FROM Cars
WHERE model = ’$m’ AND color=’$c’;
562

Now we suppose the only template we have is
color. However the wrapper is asked by the
Mediator to find “blue Gobi model car.”
563
Solution:
1. Use template with $c=‘blue’ find all blue
cars and store them in a temporary relation:
TemAutos (serialNo, model, color, autoTrans,
dealer)
2.The wrapper then return to the mediator
the desired set of automobiles by excuting the
local query:
SELECT*
FROM TemAutos
WHERE model= ’Gobi’;
564
Information integration
SECTIONS 21.4 – 21.5
21.4 Capability Based Optimization
565

Introduction
 Typical
DBMS estimates the cost of each query
plan and picks what it believes to be the best
 Mediator – has knowledge of how long its sources
will take to answer
 Optimization of mediator queries cannot rely on
cost measure alone to select a query plan
 Optimization by mediator follows capability based
optimization
21.4.1 The Problem of Limited
Source Capabilities
566





Many sources have only Web Based interfaces
Web sources usually allow querying through a query
form, which does not accept arbitrary SQL queries.
Rather, we are invited to enter values for certain
attributes and can receive a response that gives
values for other attributes
E.g. Amazon.com interface allows us to query about
books in many different ways.We can specify an
author and get all their books, or we can specify a
book title and get book information
But we cannot ask questions that are too general
 E.g. Select * from books;
21.4.1 The Problem of Limited
Source Capabilities (con’t)
567

Reasons why a source may limit the ways in
which queries can be asked
 Earliest
database did not use relational DBMS
that supports SQL queries, surely not a
relational DBMS that supports SQL queries
 Indexes on large database may make certain
queries feasible, while others are too expensive
to execute
 Security reasons
 E.g.
Medical database may answer queries about
averages, but won’t disclose details of a particular
patient's information
21.4.2 A Notation for Describing
Source Capabilities
568


For relational data, the legal forms of queries are
described by adornments
Adornments – Sequences of codes that represent
the requirements for the attributes of the relation,
in their standard order. The code we hall use for
adornments reflect the most common
capabilities of sources, They are
f(free) – attribute can be specified or not, as we
choose
 b(bound) – must specify a value for an attribute but
any value is allowed
 u(unspecified) – not permitted to specify a value for a
attribute

21.4.2 A notation for Describing
Source Capabilities….(cont’d)
569
 c[S](choice


from set S) means that a value must
be specified and value must be one of the
values from finite set S.
 o[S](optional from set S) means either do not
specify a value or we specify a value from finite
set S
 A prime (f’) specifies that an attribute is not a part
of the output of the query
A capabilities specification is a set of adornments
A query must match one of the adornments in its
capabilities specification
21.4.2 A notation for Describing
Source Capabilities….(cont’d)
570

E.g. Dealer 1 is a source of data in the form:
Cars (serialNo, model, color, autoTrans, navi)
The adornment for this query form is b’uuuu
Ie. The first attribute, serialNo,must be specified
and is not part of the output. The other
attributes must not be specified and are part
of the output.
21.4.3 Capability-Based Query-Plan
Selection
571


Given a query at the mediator, a capability
based query optimizer first considers what
queries it can ask at the sources to help
answer the query
The process is repeated until:
 Enough
queries are asked at the sources to
resolve all the conditions of the mediator query
and therefore query is answered. Such a plan is
called feasible.
 We can construct no more valid forms of source
queries, yet still cannot answer the mediator
query. It has been an impossible query.
21.4.3 Capability-Based Query-Plan
Selection (cont’d)
572

The simplest form of mediator query where we need to apply the above strategy is join relations

E.g we have sources for dealer 2

Autos(serial, model, color)

Options(serial, option)

Suppose that ubf is the sole adornment for Auto and Options have two adornments, bu and
uc[autoTrans, navi]

Query is – find the serial numbers and colors of Gobi models with a navigation system

Here are the three plans

1. Specify that the model is Gobi, query Autos and get the serial
numbers and colors of all Gobies Then using the bu adornment for
Optionsm for each such serial number, find the options for that acar and
filter to make sure it has a navigation system

2. specififying the navigation-system option, query options using the
adornment and get all ther serial numbers for cars with a navigation
system.

3.Query option 2 to get the serial numbers for cars with a navigation
system. Then use these serial numbers to query Autos and see which of
these cars are Gobis.
21.4.4 Adding Cost-Based
Optimization
573




Mediator’s Query optimizer is not done when
the capabilities of the sources are examined
Having found feasible plans, it must choose
among them
Making an intelligent, cost based query
optimization requires that the mediator knows
a great deal about the costs of queries
involved
Sources are independent of the mediator, so it
is difficult to estimate the cost
21.5 Optimizing Mediator
Queries
574

Chain algorithm – a greed algorithm that finds
a way to answer the query by sending a
sequence of requests to its sources.
 Will
always find a solution assuming at least one
solution exists.
 The solution may not be optimal.
 This class of queries , can be handled that
involve joins of relations that come from the
sources, followed by an optional selection
and optional projection onto output attributes
 Can be expressed as datalog rules
21.5.1 Simplified Adornment
Notation
575

The chain algorithem itself with datalog
rules and with whether prior source
requests have provided bindings for any of
the variables in the body of the rule

A query at the mediator is limited to b (bound) and f (free)
adornments.

We use the following convention for describing adornments:
 nameadornments(attributes)
 where:
 name is the name of the relation
 the number of adornments = the number of attributes
21.5.2 Obtaining Answers for
Subgoals
576

Rules for subgoals and sources:
 Suppose
we have the following subgoal:
Rx1x2…xn(a1, a2, …, an), where is xi is eiher b or f
and source adornments for R are: y1y2…yn. Each
yi can be any of b, f, u, c[S] or o[S] for any set
of S. Then it is possible to obtain a relation for
the subgoal provided, for each i= 1,2 … n.
If yi is b or c[S], then xi = b.
 If xi = f, then yi is not output restricted.
The adornment on the subgoal matches the adornment at
the source:
 If yi is f, u, or o[S] and xi is either b or f.


21.5.3 The Chain Algorithm
577

Chain algorithm is greedy approach to
selecting an order in which we obtain
relations for each of the subgoals of a
Datalog rule. It is not always provide
efficient solution, but always provide
solution if it is exist

Maintains 2 types of information:
 An adornment for each subgoal.
 A relation X that is the join of the relations for all the subgoals
that have been resolved.

Initially, the adornment for a subgoal is b iff the mediator query
provides a constant binding for the corresponding argument of that
subgoal.
21.5.3 The Chain Algorithm
(con’t)
578


1.
First, initialize adornments of subgoals and X.
Then, repeatedly select a subgoal that can be
resolved. Let Rα(a1, a2, …, an) be the subgoal:
Wherever α has a b, we shall find the
argument in R is a constant, or a variable in
the schema of R.

Project X onto its variables that appear in R.
21.5.3 The Chain Algorithm
(con’t)
579
2.
For each tuple t in the project of X, issue a query to the
source as follows (β is a source adornment).

If a component of β is b, then the corresponding
component of α is b, and we can use the
corresponding component of t for source query.

If a component of β is c[S], and the corresponding
component of t is in S, then the corresponding
component of α is b, and we can use the
corresponding component of t for the source query.

If a component of β is f, and the corresponding
component of α is b, provide a constant value for
source query.
21.5.3 The Chain Algorithm
(con’t)
580



3.
If a component of β is u, then provide no binding
for this component in the source query.
If a component of β is o[S], and the
corresponding component of α is f, then treat it
as if it was a f.
If a component of β is o[S], and the
corresponding component of α is b, then treat it
as if it was c[S].
Every variable among a1, a2, …, an is now
bound. For each remaining unresolved
subgoal, change its adornment so any
position holding one of these variables is b.
21.5.3 The Chain Algorithm
(con’t)
581
4.
5.


Replace X with X πs(R), where S is all of the
variables among: a1, a2, …, an.
Project out of X all components that
correspond to variables
that do not appear in
α
the head or in any unresolved subgoal.
If every subgoal is resolved, then X is the
answer.
If every subgoal is not resolved, then the
algorithm fails.
21.5.3 The Chain Algorithm Example
582

Mediator query:
 Q:
Answer(c) ← Rbf(1,a) AND Sff(a,b) AND Tff(b,c)
Example:
Relation
Data

Adornment
R
S
T
w
x
x
y
y
z
1
2
2
4
4
6
1
3
3
5
5
7
1
4
5
8
bf
c’[2,3,5]f
bu
21.5.3 The Chain Algorithm Example
(con’t)
583

Initially, the adornments on the subgoals are
the same as Q, and X contains an empty tuple.
S
and T cannot be resolved because they each
have ff adornments, but the sources have either a
b or c.


R(1,a) can be resolved because its
adornments are matched by the source’s
adornments.
Send R(w,x) with w=1 to get the tables on the
previous page.
21.5.3 The Chain Algorithm Example
(con’t)
584

Project the subgoal’s relation onto its second
component, since only the second component
of R(1,a) is a variable.
a
2
3
4


This is joined with X, resulting in X equaling
this relation.
Change adornment on S from ff to bf.
21.5.3 The Chain Algorithm Example
(con’t)
585

Now we resolve Sbf(a,b):
 Project
X onto a, resulting in X.
 Now, search S for tuples with attribute a
equivalent to attribute a in X.

a
b
2
4
3
5
Join this relation with X, and remove a
because it doesn’t appear in the head nor any
unresolved subgoal: b
4
5
21.5.3 The Chain Algorithm Example
(con’t)
586



Now we resolve Tbf(b,c):
b
c
4
6
5
7
5
8
Join this relation with X and project onto the c
attribute to get the relation for the head.
Solution is {(6), (7), (8)}.
21.5.4 Incorporating Union Views at
the Mediator
587



This implementation of the Chain Algorithm
does not consider that several sources can
contribute tuples to a relation.
If specific sources have tuples to contribute
that other sources may not have, it adds
complexity.
To resolve this, we can consult all sources, or
make best efforts to return all the answers.
21.5.4 Incorporating Union Views at
the Mediator (con’t)
588

Consulting All Sources
 We
can only resolve a subgoal when each source
for its relation has an adornment matched by the
current adornment of the subgoal.
 Less practical because it makes queries harder to
answer and impossible if any source is down.

Best Efforts
 We
need only 1 source with a matching
adornment to resolve a subgoal.
 Need to modify chain algorithm to revisit each
subgoal when that subgoal has new bound
requirements.
21.5.4 Incorporating Union Views at
the Mediator (con’t)
589

Consulting All Sources
 We
can only resolve a subgoal when each source
for its relation has an adornment matched by the
current adornment of the subgoal.
 Less practical because it makes queries harder to
answer and impossible if any source is down.

Best Efforts
 We
need only 1 source with a matching
adornment to resolve a subgoal.
 Need to modify chain algorithm to revisit each
subgoal when that subgoal has new bound
requirements.
590
21.6 LOCAL-AS-VIEW
MEDIATORS
Local-as-View Mediators.
591



In a LAV mediator, global predicates defined
are not views of the source data.
Expressions are defined, involving the
global predicates that describe the tuples
that the source is able to produce.
Queries are answered at the mediator by
discovering all possible ways to construct the
query using the views provided by the source.
Motivation for LAV Mediators
592



Relationship between what the mediator
should provide and what the sources
provide is more subtle.
Example, consider the predicate Par(c, p)
meaning that p is a parent of c which
represents the set of all child parent facts that
could ever exist.
The sources will provide information about
whatever child-parent facts they know.
Motivation(contd..)
593




There can be sources which may provide
child-grandparent facts but not child- parent
facts at all.
This source can never be used to answer the
child-parent query under GAV mediators.
LAV mediators allow to say that a certain
source provides grand parent facts.
They help discover how and when to use that
source in a given query.
Terminology for LAV Mediation.
594




The queries at the mediator and that
describe the source will be single Datalog
rules.
Single Datalog rule is often called a
conjunctive query.
The global predicates of the LAV mediator are
used as the sub-goals of mediator queries.
Conjunctive queries define views.
Contd..
595



Their heads each have a unique view
predicate that is the name of a view.
Each view definition has a body consisting of
global predicates and is associated with a
particular source.
Assumption - each view can be
constructed with an all-free adornment.
Example..
596



Consider global predicate Par(c, p) meaning
that p is a parent of c.
One source produces parent facts. Its view is
defined by the conjunctive queryV1(c, p)  Par(c, p)
Another source produces some grand parents
facts. Then its conjunctive query will be –
V2(c, g)  Par(c, p) AND Par(p, g)
Example contd..
597
The query at the mediator will ask for greatgrand parent facts that can be obtained from
the sources. The mediator query is –
Q(w, z)  Par(w, x) AND Par(x, y) AND Par(y, z)

One solution can be using the parent
predicate(V1) directly three times.
Q(w, z)  V1(w, x) AND V1 (x, y) AND V1(y, z)

Example contd..
598

Another solution can be to use V1(parent facts)
and V2(grandparent facts).
Q(w, z)  V1(w, x) AND V2(x, z)
Or
Q(w, z)  V2(w, y) AND V1(y, z)
Expanding Solutions.
599


Consider a query Q, a solution S that has a
body whose subgoals are views and each
view V is defined by a conjunctive query with
that view as the head.
The body of V’s conjunctive query can be
substituted for a subgoal in S that uses the
predicate V to have a body consisting of only
global predicates.
Expansion Algorithm
600
A solution S has a subgoal V(a1, a2,…an)
where ai’s can be any variables or constants.
 The view V can be of the form
V(b1, b2,….bn)  B
Where B represents the entire body.
 V(a1, a2, … an) can be replaced in solution S
by a version of body B that has all the
subgoals of B with variables possibly altered.

Expansion Algorithm contd..
601

1.
2.
3.
The rules for altering the variables of B are:
First identify the local variables B, variables
that appear in the body but not in the head.
If there are any local variables of B that
appear in B or in S, replace each one by a
distinct new variable that appears nowhere
in the rule for V or in S.
In the body B, replace each bi by ai for
i = 1,2…n.
Example.
602



Consider the view definitions,
V1(c, p)  Par(c, p)
V2(c, g)  Par(c, p) AND Par(p, g)
One of the proposed solutions S is
Q(w, z)  V1(w, x) AND V2(x, z)
The first subgoal with predicate V1 in the
solution can be expanded as Par(w, x) as
there are no local variables.
Example Contd.
603
The V2 subgoal has a local variable p which
doesn’t appear in S nor it has been used as a
local variable in another substitution. So p can
be left as it is.
 Only x and z are to be substituted for variables
c and g.
 The Solution S now will be
Q(w, z)  Par(w, x) AND Par(x, p) AND Par(p,z)

Containment of Conjunctive
Queries
604

1.
2.
A containment mapping from Q to E is a
function т from the variables of Q to the
variables and constants of E, such that:
If x is the ith argument of the head of Q,
then т(x) is the ith argument of the head of
E.
Add to т the rule that т(c)=c for any constant
c. If P(x1,x2,… xn) is a subgoal of Q, then
P(т(x1), т(x2),… т(xn)) is a subgoal of E.
Example.
605
Consider two Conjunctive queries:
Q1: H(x, y)  A(x, z) and B(z, y)
Q2: H(a, b)  A(a, c) AND B(d, b) AND A(a, d)
 When we apply the substitution,
Т(x) = a, Т(y) = b, Т(z) = d, the head of Q1
becomes H(a, b) which is the head of Q2.
So,there is a containment mapping from Q1 to
Q2.

Example contd..
606



The first subgoal of Q1 becomes A(a, d) which
is the third subgoal of Q2.
The second subgoal of Q1 becomes the
second subgoal of Q2.
There is also a containment mapping from Q2
to Q1 so the two conjunctive queries are
equivalent.
Why the Containment-Mapping
Test Works
607




Suppose there is a containment mapping т
from Q1 to Q2.
When Q2 is applied to the database, we look
for substitutions σ for all the variables of Q2.
The substitution for the head becomes a tuple
t that is returned by Q2.
If we compose т and then σ, we have a
mapping from the variables of Q1 to tuples of
the database that produces the same tuple t
for the head of Q1.
608
Finding Solutions to a Mediator
Query


There can be infinite number of solutions built
from the views using any number of subgoals and
variables.
LMSS Theorem can limit the search which states
that
•

If a query Q has n subgoals, then any answer
produced by any solution is also produced by a
solution that has at most n subgoals.
If the conjunctive query that defines a view V has
in its body a predicate P that doesn’t appear in the
body of the mediator query, then we need not
consider any solution that uses V.
Example.
609
Recall the query
Q1: Q(w, z) Par(w, x) AND Par(x, y) AND
Par(y, z)
 This query has three subgoals, so we don’t
have to look at solutions with more than three
subgoals.

Why the LMSS Theorem Holds
610



Suppose we have a query Q with n subgoals
and there is a solution S with more than n
subgoals.
The expansion E of S must be contained in
Query Q, which means that there is a
containment mapping from Q to E.
We remove from S all subgoals whose
expansion was not the target of one of Q’s
subgoals under the containment mapping.
Contd..
611




We would have a new conjunctive query S’
with at most n subgoals.
If E’ is the expansion of S’ then, E’ is a subset
of Q.
S is a subset of S’ as there is an identity
mapping.
Thus S need not be among the solutions to
query Q.
INFORMATION INTEGRATION
ENTITY RESOLUTION – 21.7
Contents

21.7 Entity Resolution
 21.7.1 Deciding Whether Records
Represent a Common Entity
 21.7.2 Merging Similar Records
 21.7.3 Useful Properties of Similarity and
Merge Functions
 21.7.4 The R-Swoosh Algorithm for ICAR
Records
 21.7.5 Other Approaches to Entity
Resolution
Introduction

Determining whether two records or tuples do
or do not represent the same person,
organization, place or other entity is called
ENTITY RESOLUTION.
Deciding whether Records
represent a Common Entity


Two records represent the same individual if
the two records have similar values for each
of the fields associated with those records.
It may be possible that two records are
same but corresponding fields are not
identical because of following reasons:
1. Misspellings
2. Variant Names
3. Misunderstanding of Names
Continue: Deciding whether
Records represent a Common Entity
4. Evolution of Values
5. Abbreviations
Thus when deciding whether two records represent
the same entity, we need to look carefully at the
kinds of discrepancies and use the test that
measures the similarity of records.
Deciding Whether Records
Represents a Common Entity Edit Distance

First approach to measure the similarity of
records is Edit Distance.

Values that are strings can be compared by
counting the number of insertions and
deletions of characters it takes to turn one
string into another.

So the records represent the same entity if
their similarity measure is below a given
Deciding Whether Records Represents a
Common Entity - Normalization

To normalize records by replacing certain
substrings by others. For instance: we can use
the table of abbreviations and replace
abbreviations by what they normally stand for.

Once normalize we can use the edit distance to
measure the difference between normalized
values in the fields.
Merging Similar Records


Merging means replacing two similar
records with single record which contain
information of both.
There are many merge rules:
1. Set the field in which the records disagree to
the empty string.
2. (i) Merge by taking the union of the values in
each field
(ii) Declare two records similar if at least two of
the three fields have a nonempty intersection.
Continue: Merging Similar
Records
Name
Address
Phone
1. Susan
123 Oak St.
818-555-1234
2. Susan
456 Maple St.
818-555-1234
3. Susan
456 Maple St.
213-555-5678
After Merging
Name
Address
Phone
(1-2-3) Susan {123 Oak St.,456 Maple St}
{818-555-1234, 213-555-5678}
Properties of Similarity and
Merge Functions
The following properties say that the merge
operation is a semi lattice :
1.
Idempotence : That is, the merge of a
record with itself should surely be that
record.
 Commutativity : Order of merging should
not matter and end result should be
same.
1.
Associativity : The order in which we
group records for a merger should not
matter.
Continue: Useful Properties of
Similarity and Merge Functions
There are some other properties that we expect
similarity relationship to have:

Idempotence for similarity : A record is
always similar to itself

Commutativity of similarity : In deciding
whether two records are similar it does not
matter in which order we list them

Representability : If ‘r’ is similar to record ‘s’,
and ‘s’ is merged with some other record t’’,
then ‘r’ remains similar to the merger of ‘s’ and
‘t’ and can be merged with that record.
R-swoosh Algorithm for ICAR
Records



Input: A set of records I, similarity function and a merge
function.
Output: A set of merged records O.
Method:


O:= emptyset;
WHILE I is not empty DO BEGIN
 Let r be any record in I;
 Find, if possible, some record s in O that is similar to
r;
 IF no record s exists THEN
move r from I to O
 ELSE BEGIN
delete r from I;
delete s from O;
add the merger of r and s to
I;
 END;
Other Approaches to Entity
Resolution
The other approaches to entity resolution are
:
 Non-
ICAR Datasets
 Clustering
 Partitioning
Other Approaches to Entity
Resolution - Non ICAR Datasets
Non ICAR Datasets : We can define a
dominance relation r<=s that means record s
contains all the information contained in
record r.
If so, then we can eliminate record r from
further consideration.
Other Approaches to Entity
Resolution - Clustering
Clustering: Some time we group the records into
clusters such that members of a cluster are in
some sense similar to each other and members
of different clusters are not similar.
Other Approaches to Entity
Resolution - Partitioning
Partitioning: We can group the records, perhaps
several times, into groups that are likely to
contain similar records and look only within each
group for pairs of similar records.

xanadu.cs.sjsu.edu

Transcript xanadu.cs.sjsu.edu

Directory