What is a DBMS?

Download Report

Transcript What is a DBMS?

B-Trees
Size and Lookup
B-Trees (I)
Suppose:
• Blocks can hold either:
- 10 records or
- 99 keys and 100 pointers
• B-Tree nodes are 70% full
- 69 keys and 70 pointers
• 1,000,000 records
A. Records sorted on search key, with
10 records per blocks. The B-tree is
a dense index.
• 1,000,000 /10 = 100,000 blocks to
hold the records
• For the B-tree we reason as follows:
–
–
• For each structure described
below, determine
- The total number of blocks
- The average # of I/O for
lookup given the search key.
–
•
•
We need 1,000,000 pointers at the
leaves to point to each record.

They can be packed into
1,000,000 / 70 = 14,286 leave
blocks.
We need 14,286 pointers in the
above level.

Packed into 14,286/70=204
blks
We need 204 pointers in the above

Packed into 204/70 = 3 blocks
Total: 100,000 + 14,286 + 204 + 3
1(root) = 114,494 blocks
Since the tree has 4 levels, we need
5 I/O’s for a lookup.
B-Trees (II)
Suppose:
• Blocks can hold either:
- 10 records or
- 99 keys and 100 pointers
• B-Tree nodes are 70% full
- 69 keys and 70 pointers
• 1,000,000 records
• For each structure described
below, determine
- The total number of blocks
- The average # of I/O for
lookup given the search key.
B Same as (A), but the records aren’t
sorted. Still, records packed 10 to a
block.
• Same as (A) = 114,494 blocks
• Same as (A) = 5 I/O’s for a lookup.
B-Trees (III)
Suppose:
• Blocks can hold either:
- 10 records or
- 99 keys and 100 pointers
• B-Tree nodes are 70% full
- 69 keys and 70 pointers
• 1,000,000 records
C Same as (A), but B-tree is a sparse
index.
• 1,000,000 /10 = 100,000 blocks to
hold the records
• For the sparse B-tree:
– We need 100,000 pointers at the
leaves to point to each data block.
 They can be packed into 100,000
/ 70 = 1,429 leave blocks.
– We need 1,429 pointers in the above
level.
 Packed into 1429/70=21 blocks
– We need 21 pointers in the above
 Packed into 21/70 = 1 block
(root)
• For each structure described
below, determine
- The total number of blocks
• Total: 100,000 + 1,429 + 21 + 1 =
- The average # of I/O for
101,451blocks
lookup given the search key.
• Since the tree has 3 levels, we need
4 I/O’s for a lookup.
D Instead of the B-Tree leaves having
pointers to data records, the B-Tree
leaves hold the records themselves. A
block can hold 10 records, but a leaf
Suppose:
block is in fact 70% full, i.e. there are 7
• Blocks can hold either:
records per leaf.
- 10 records or
• 1,000,000 /7 = 142,857 blocks to hold the
records. These blocks will be the leaves
- 99 keys and 100 pointers
of the B-tree.
• B-Tree nodes are 70% full
• We need 142,857 pointers at the next
- 69 keys and 70 pointers
level.
• 1,000,000 records
– They can be packed into 142,857/ 70
= 2,040 blocks.
• We need 2,040 pointers in the above
• For each structure described
next level.
below, determine
– Packed into 2040/70=30 blocks
- The total number of blocks
• We need 30 pointers in the above…
- The average # of I/O for
– Packed into 30/70 = 1 block (root)
lookup given the search key.
• Total: 142857 + 2040 +30 +1 = 144,928
blocks
• Since the tree has 4 levels, we need 4
I/O’s for a lookup.
B-Trees (IV)
B-Trees
Range Queries:
Repeat the exercise in the case that the
query is a range query that is matched by
1000 records.
B-Trees (I)
Suppose:
• Blocks can hold either:
- 10 records or
- 99 keys and 100 pointers
• B-Tree nodes are 70% full
- 69 keys and 70 pointers
• 1,000,000 records
• For each structure described
below, determine
- The total number of blocks
- The average # of I/O for
lookup given the search key.
A. Records sorted on search key, with
10 records per blocks. The B-tree is
a dense index.
• Since the tree has 4 levels, we need
4 I/O’s to go to the leaf where the
pointer to start of the range is
located.
• Then, by following the sibling
pointers we retrieve all the leaves
holding the pointers to the matching
records.
• 1000 pointers are packed in
1000/70 = 14 blocks.
• Now, by following each of the 1000
pointers we read the 1000 records.
–
•
Since the records are sorted the
1000 range records will occupy
(almost) as few blocks as possible,
i.e. 1000 / 10 = 100 blocks.
In total, we need 4 + 14 + 100 = 118
I/O’s.
B-Trees (II)
Suppose:
• Blocks can hold either:
- 10 records or
- 99 keys and 100 pointers
• B-Tree nodes are 70% full
- 69 keys and 70 pointers
• 1,000,000 records
B Same as (A), but the records aren’t
sorted. Still, records packed 10 to a
block.
• As in (A) we need 4+14 I/O’s to
locate the 1000 pointers.
• However, since the records aren’t
sorted, it might be that the 1000
records are located in 1000 different
blocks. So, we might end up reading
1000 blocks.
• For each structure described
below, determine
• In total, we need 4 + 14 + 1000 =
- The total number of blocks
1018 I/O’s.
- The average # of I/O for
lookup given the search key.
B-Trees (III)
Suppose:
• Blocks can hold either:
- 10 records or
- 99 keys and 100 pointers
• B-Tree nodes are 70% full
- 69 keys and 70 pointers
• 1,000,000 records
• For each structure described
below, determine
- The total number of blocks
- The average # of I/O for
lookup given the search key.
C Same as (A), but B-tree is a sparse
index.
• Since the tree has 3 levels, we need
3 I/O’s to go to the leaf where the
pointer to start of the range is
located.
• How many pointers to data we need
to follow?
• 1000 records are packed into
1000/10 = 100 blocks. So, we need
to follow 100 pointers.
• How many leaves are needed to
pack 100 pointers?
• 2 leaves.
• Total: 3 + 2 + 100 = 105 I/O’s
B-Trees (IV)
Suppose:
• Blocks can hold either:
- 10 records or
- 99 keys and 100 pointers
• B-Tree nodes are 70% full
- 69 keys and 70 pointers
• 1,000,000 records
D Instead of the B-Tree leaves having
pointers to data records, the B-Tree
leaves hold the records themselves. A
block can hold 10 records, but a leaf
block is in fact 70% full, i.e. there are 7
records per leaf.
•
Since the tree has 4 levels, we need 4
I/O’s to go to the leaf where the start
record of the range is located.
•
• For each structure described
below, determine
- The total number of blocks
- The average # of I/O for
lookup given the search key. •
By following the sibling pointers we need
to retrieve as many leaves as are needed
to hold 1000 records.
– 1000 records are packed in 1000 /7 =
143 leaves.
Total: 4 + 143 = 147 I/O’s
Hash Tables
Extensible Hash Tables
Dynamic Hashing Framework
• Hash function h produces a sequence of k bits.
• Only some of the bits are used at any time to determine
placement of keys in buckets.
Extensible Hashing (Buckets may share blocks!)
• Keep parameter i = number of bits from the beginning of h(K)
that determine the bucket.
• Bucket array now = pointers to buckets.
- A block can serve as several buckets.
- For each block, a parameter ji tells how many bits of h(K)
determine membership in the block.
- I.e., a block represents 2i-j buckets that share the first j
bits of their number.
Example
• An extensible hash table when i=1:
Extensible Hashtable Insert
•
•
If record with key K fits in the block pointed to by h(K), put it
there.
If not, let this block B represent j bits.
1. j<i:
a) Split block B into two and distribute the records (of B)
according to (j+1)st bit;
b) set j:=j+1;
c) fix pointers in bucket array, so that entries that formerly
pointed to B now point either to B or the new block
How?
depending on…(j+1)st bit
2. j=i:
Let w be an old array entry.
Both the new entries w0 and w1
point to the same block that w used
to point to.
1. Set i:=i+1;
2. Double the bucket array,
so it has now 2i+1 entries;
3. proceed as in (1).
Example
• Insert record with h(K) = 1010.
Now, after the
insertion
Before
Example: Next
Currently
• Next: records with
h(K)=0000; h(K)=0111.
- Bucket for 0... gets split,
- but i stays at 2.
• Then: record with h(K) = 1000.
- Overflows bucket for 10...
- Raise i to 3.
After the
insertions
Exercise
• Suppose we want to insert keys with hash values:
0000…1111 in an extensible hash table.
• Assume that a block can hold three records.
i=1
i=1
0000
0
1
1
0
1
1
0000
0001
0010
1
1
i=1
0
1
0000
0001
1
1
Insertion of 0011.
No room
i=2
00
01
10
11
i=3
0000
0001
0010
2
2
1
This is
the new
block.
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
This is
the new
block.
2
1
i=3
000
001
010
011
100
101
110
111
i=3
0000
0001
3
0010
0011
3
2
1
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
2
1
i=3
000
001
010
011
100
101
110
111
i=3
0000
0001
3
0010
0011
3
0100
0101
2
1
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
0110
2
1
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
1
0110
0111
3
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
1000
1
0110
0111
3
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
1000
1001
1
0110
0111
3
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
1000
1001
1010
1
0110
0111
3
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
1000
1001
1010
2
0110
0111
3
2
Still no room for
1011
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
0110
0111
3
1000
1001
3
1010
1011
3
2
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
0110
0111
3
1000
1001
3
1010
1011
3
1100
2
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
0110
0111
3
1000
1001
3
1010
1011
3
1100
1101
2
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
0110
0111
3
1000
1001
3
1010
1011
3
1100
1101
1110
2
i=3
000
001
010
011
100
101
110
111
0000
0001
3
0010
0011
3
0100
0101
3
0110
0111
3
1000
1001
3
1010
1011
3
1100
1101
3
1110
1111
3
Hash Tables
Linear Hash Tables
Linear Hashing
•
•
•
Use i bits from right (loworder) end of h(K).
Buckets numbered [0…n-1], where 2i-1<n2i.
Let last i bits of h(K) be m = (a1,a2,…,ai)
1. If m < n, then record belongs in bucket m.
2. If nm<2i, then record belongs in bucket m-2i-1, that is
the bucket we would get if we changed a1 (which must
be 1) to 0.
i=1
#of buckets
n=2
#of records
r=3
This is also part
of the structure
Linear HashTable Insert
• Pick an upper limit on capacity,
- e.g., 85% (1.7 records/bucket in our example).
• If an insertion exceeds capacity limit, set n := n + 1.
- If new n is 2i + 1, set i := i + 1. No change in bucket
numbers needed --- just imagine a leading 0.
- Need to split bucket n - 2i-1 because there is now a
bucket numbered (old) n.
Example
• Insert record with h(K) = 0101.
- Capacity limit exceeded; increment n.
i=1
#of buckets
n=2
#of records
r=3
i=2
#of buckets
n=3
#of records
r=4
Example
• Insert record with h(K) = 0001.
- Capacity limit not exceeded.
- But bucket is full; add overflow bucket.
i=2
n=3
r=5
Example
• Insert record with h(K) = 1100.
- Capacity exceeded; set n = 4, add bucket 11.
- Split bucket 01.
i=2
n=4
r=7
Lookup in Linear Hash Table
• For record(s) with search key K, compute h(K); search the
corresponding bucket according to the procedure described
for insertion.
• If the record we wish to look up isn’t there, it can’t be
anywhere else.
• E.g. lookup for a key which hashes to 1010, and then for a
key which hashes to 1011.
i=2
n=3
r=4
Exercise
• Suppose we want to insert keys with hash values:
0000…1111 in a linear hash table with 100% capacity
threshold.
• Assume that a block can hold three records.
i=1
n=1
0000
0
n=1
r=1
r=3
0
0000
0001
0
0000
0010
1
0001
0011
r=2
i=1
n=1
i=1
0
0000
0001
0010
i=1
n=2
r=4
i=1
n=2
0
r=5
1
0000
0010
0100
0001
0011
i=1
n=2
0
0000
0010
0100
1
0001
0011
0101
r=6
i=2
n=3
0000
00
0100
r=7
01
10
0001
0011
0101
0010
0110
Continue at home…
Multidimensional Indexes
Grid files (hash-like structure)
• Divide data into stripes in
each dimension
• Rectangle in grid points
to bucket
• Example: database
records (age,salary) for
people who buy gold
jewelry.
Data:
(25,60) (45,60) (50,75) (50,100)
(50,120) (70,110) (85,140) (30,260)
(25,400) (45,350) (50,275) (60,260)
Grid file
Operations
Lookup
Find coordinates of point in each dimension --gives you a bucket to search.
Nearest Neighbor
Lookup point P . Consider points in that bucket.
• Problem: there could be points in adjacent
buckets that are closer.
- Example: NN of (45; 200).
• Problem: there could be no points at all in the
bucket: widen search?
Range Queries
Ranges define a region of buckets.
• Buckets on border may contain points not in
range.
• Example: 35 < age <= 45; 50 < salary <=
100.
Queries Specifying Only One Attribute
• Problem: must search a whole row or column
of buckets.
Insertion
• Use overflow buckets,
or split stripes in one or
more dimensions
• Insert (52,200). Split
central bucket, for
instance by splitting
central salary stripe
• The blocks of 3 buckets
are to be processed.
• In general the blocks of
n buckets are to be
processed during a
split.
• n is the number of
buckets in the
chosen direction
• Very expensive.
Partitioned hashing
• Example: Gold jewelry
with
• first bit = age mod 2
• bits 2 and 3: salary
mod 4
• Works well for:
• partial match (i.e. just
an attribute specified)
• Bad for:
• range
• Nearest Neighbors
queries
KD-Trees
• Generalizes binary
search trees, but
search attributes rotate
among dimensions
• Levels rotate among
the dimensions,
partitioning the points
by comparison with a
value for that
dimension.
• Leaves are blocks
Geometrically…
• Remember we didn’t
want the stripes in grid
files to continue all along
the vertical or horizontal
direction?
• Here they don’t.
Operations
Lookup in KDTrees
• Find appropriate leaf by binary search. Is the record there?
Insert Into KDTrees
• Lookup record to be inserted, reaching the appropriate leaf.
• If there is room, put record in that block.
• If not, find a suitable value for the appropriate dimension and
split the leaf block.
Example
• Someone 35 years old with a salary of $500K buys gold
jewelry.
• Belongs in leaf with (25; 400) and (45; 350).
• Too full: split on age. See figure next.
Split at 35 is because
it is the median.
Queries
Partial match queries
• When we don’t know
the value of the
attribute at the node,
we must explore both
of its children.
- E.g. find points with
age=50
Range Queries
• Sometimes a range will
allow us to move to
only one child of a
node.
• But if the range
straddles the splitting
value then we must
explore both children.
R-Tree Lookup (Where am I)
• We start at the root, with which the
entire region is associated.
• We examine the subregions at the
root and determine which children
correspond to interior regions that
contain point P.
• If there are zero regions we are
done; P is not in any data region.
• If there are some subregions we
must recursively search those
children as well, until we reach the
leaves of the tree.
R-Tree Insertion
• We start at the root and try to find some subregion into R fits.
If more than one we pick just one, and repeat the process
there.
• If there is no region, we expand, and we want to expand as
little as possible. So, we pick the child that will be expanded
as little as possible.
• Eventually we reach a leaf, where we insert the region R.
• However, if there is no room we have to split the leaf. We
split the leaf in such a way as to have the smallest
subregions.
Example
• Suppose that the
leaves have room for
six regions.
• Further suppose that
the six regions are
together on one leaf,
whose region is
represented by the
outer solid rectangle.
• Now suppose that
another region POP
is added.
Example (Cont’ ed)
((0,0),(60,50))
Road1 Road2 House1
((20,20),(100,80))
School House2 Pipeline Pop
Example (Cont’ ed)
• Suppose now that
House3 ((70,5),(80,15))
gets added.
• We do have space to the
leaves, but we need to
expand one of the
regions at the parent.
• We choose to expand
the one which needs to
be expanded the least.
Which one should we expand?
((0,0),(80,50))
((20,20),(100,80))
Road1 Road2 House1 House3
((0,0),(60,50))
Road1 Road2 House1
School House2 Pipeline Pop
((5,20),(100,80))
School House2 Pipeline Pop House3
Bitmap Indexes
• Suppose we have n tuples.
• A bitmap index for a field F is a collection of bit vectors of
length n, one for each possible value that may appear in the
field F.
• The vector for value v has 1 in position i if the i-th record has
v in field F, and it has 0 there if not.
(30, foo)
(30, bar)
(40, baz)
(50, foo)
(40, bar)
(30, baz)
foo 100100
bar 0…
baz …
Motivation for Bitmap Indexes
• They allow very fast evaluation of partial match queries.
SELECT title
FROM Movie
WHERE studioName=‘Disney’ AND year=1995;
If there are bitmap indexes on both studioName and year, we
can intersect the vectors for the Disney value and 1995
value.
We should have another index to retrieve the tuples by number.