Transcript Lecture 10

Indexing
1
Motivation
Sells( bar,
Joe’s
Joe’s
Sue’s
Sue’s
beer,
Bud
Miller
Bud
Coors
price )
2.50
2.75
2.50
3.00
Bars( bar, addr
)
Joe’s Maple St.
Sue’s River Rd.
Query: Find all locations that sell beer for 2.75.
Select addr
From Sells, Bars
Where (Sells.bar = Bars.bar) and (Sells.price = 2.75)
2
Sells(
bar,
Joe’s
Joe’s
Sue’s
Sue’s
beer,
Bud
Miller
Bud
Coors
price )
2.50
2.75
2.50
3.00
Bars(bar,
Joe’s
Sue’s
addr
)
Maple St.
River Rd.
PROJECTBars.addr
SELECTSells.price = 2.75
JOIN
Sells
Sells.bar = Bars.bar
Bars
3
Sells(
bar,
Joe’s
Joe’s
Sue’s
Sue’s
beer,
Bud
Miller
Bud
Coors
price )
2.50
2.75
2.50
3.00
Bars(bar,
Joe’s
Sue’s
addr
)
Maple St.
River Rd.
PROJECTBars.addr
JOIN
Sells.bar = Bars.bar
SELECTSells.price = 2.75
Sells
Can speed up
execution if
we have an
index on
price
Bars
4
Example of Using an Index
5
Index should be on disk; here’s a disk-based hash index example
6
The whole table could be organized as an
index; here’s a table organized as a hash
7
Summary
• Indexes help searching for information far more
efficiently
– speed up query execution
• Hash indexes
– have to be disk based
– can be outside the table, pointing into records in table
– or the whole table can be organized as a hash index
• Great for equality search
• Not so great for range search
8
Range Search
Sells( bar,
Joe’s
Joe’s
Sue’s
Sue’s
beer,
Bud
Miller
Bud
Coors
price )
2.50
2.75
2.50
3.00
Bars( bar, addr
)
Joe’s Maple St.
Sue’s River Rd.
Query: Find all locations that sell beer between 2.75 and 3
Select addr
From Sells, Bars
Where (Sells.bar = Bars.bar) and (Sells.price >= 2.75)
and (Sells.price <= 3)
9
Hashes are not well suited for range search
10
A better idea: organize table as a sorted file
11
We need more than just a sorted file: B+ trees
12
The B+ tree index can be outside the table
13
A larger B+ tree example; here leaves
point to where the data records reside
Note how the leaves are
chained together, to help
range search
80
20
10
10
15
15
18
60
100
20
18
20
30
30
40
40
50
60
50
60
65
120
140
80
65
80
85
85
90
90
14
B+ Trees Basics
• Parameter d = the degree
• Each node has >= d and <= 2d keys (except root)
– if a node has n keys, then it has (n+1) pointers to
lower-level nodes
• Each leaf has >=d and <= 2d keys
– if a leaf has n keys, then it has n pointers to the data
records with those keys
– and also a pointer to the next leaf (to facilitate range
query answering)
15
B+ Tree Example
d=2
each non-root node must have
at least 2 keys
80
20
k < 20
10
60
100
20 <= k < 60
15
18
20
120
140
60 <= k
30
40
50
60
65
80
85
90
pointer to data records
10
15
data records
18
20
30
40
50
60
65
80
85
90
16
B+ Tree Design
• How large is d ?
• Example:
– Key size = 4 bytes
– Pointer size = 8 bytes
– Block size = 4096 byes
• 2d x 4 + (2d+1) x 8 <= 4096
• d = 170
17
Searching a B+ Tree
• Equality queries (i.e., given exact key values):
– Start at the root
– Proceed down, to the leaf
• Range queries:
– As above
– Then sequential traversal
Select name
From people
Where age = 18
Select name
From people
Where 20 <= age
and age <= 65
18
Searching a B+ Tree
80
20
10
10
15
15
18
60
100
20
18
20
30
30
40
40
50
60
50
60
65
120
140
80
65
80
85
85
90
90
19
20
B+ Trees in Practice
• Typical order: 100. Typical fill-factor: 67%.
– so average fanout = 200*0.67% = 133
• Typical capacities:
– Height 4: 1334 = 312,900,700 records
– Height 3: 1333 = 2,352,637 records
• Can often hold top levels in buffer pool:
– Level 1 =
1 page = 8 Kbytes
– Level 2 =
133 pages = 1 Mbyte
– Level 3 = 17,689 pages = 133 MBytes
21
22
Sample Exam Question
• Assume a file which has 950 thousands (that is, 950000)
records. Assume also that we are indexing this file using
a B+ tree. In this particular B+ tree, the average page has
an occupancy of 100 pointers (that is, the tree's average
branching factor is 100).
• Assume further that the amount of memory set aside for
storing the index is 150 blocks, and that all 950000
records of the above file reside on disk. Assume no
duplicate search key exists, given a search key K,
compute the minimal number of disk I/O needed to
retrieve the record with that search key. Explain your
answer.
23
Solution
• Assume each leaf node points to data records
– so leaf nodes do not contain data records
• We have 950,000 records, so need 950,000 pointers from
leaf nodes
• Each leaf node can point to 100 records, so need 9,500
leaf nodes
• So tree has three levels: root at level 0, 100 nodes at
level 1, 10000 nodes at level 2
• Have 150 memory pages for index, so can store root
node + all of level 1 and some of level 2
• So minimal cost is 1
24
Some Math
• Assume tree has fanout of F (that is, each nonleaf node has F pointers to lower-level nodes)
• Assume tree has N leaf pages
• Then the number of levels in tree is log_F(N)
• So each insert/delete/lookup costs roughly
log_F(N)
– assuming the entire tree is on disk
• This assumes that tree will stay height-balanced
• Insert/delete must ensure that tree stay heightbalanced
25
To explain insert/delete, we will use
the following B+ tree, with a slightly
different notation
note how pointers are
specified using a
different notation here
Root
13
2*
3*
5*
7*
14* 16*
17
24
19* 20* 22*
30
24* 27* 29*
33* 34* 38* 39*
2* means a pair of (key, pointer to data record)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
26
Inserting a Data Entry into a B+ Tree
Find correct leaf L.
 Put data entry onto L.



If L has enough space, done!
Else, must split L (into L and a new node L2)
• Redistribute entries: d in L, (d+1) in L2, copy up middle key.
• Insert index entry pointing to L2 into parent of L.

Then if parent of L needs to be split up, do so;
this can happen all the way to the root


To split a non-leaf node, redistribute entries evenly,
but push up middle key. (Contrast with leaf splits.)
Splits “grow” tree; root split increases height.

Tree growth: gets wider or one level taller at top.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
27
Root
Insert 8*
3*
2*
5*
7*
13
17
24
19* 20* 22*
14* 16*
30
24* 27* 29*
33* 34* 38* 39*
first insert 8* in here
see that node is full; so have to split
Entry to be inserted in parent node.
(Note that 5 is
s copied up and
continues to appear in the leaf.)
5
2*
3*
5*
7*
8*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
28
insert 5 in here, but
now this node is full
13
5
2*
3*
5*
7* 8*
14* 16*
17
5
13
17
24
24
30
19* 20* 22*
24* 27* 29*
33* 34* 38* 39*
Entry to be inserted in parent node.
(Note that 17 is pushed up and only
appears once in the index. Contrast
this with a leaf split.)
30
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
29
Example B+ Tree After Inserting 8*
Root
17
5
2*
3*
24
13
5*
7* 8*
14* 16*
19* 20* 22*
30
24* 27* 29*
33* 34* 38* 39*
 Notice that root was split, leading to increase in height.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
30
Note on Inserting 8* into Example
B+ Tree


Observe how
minimum
occupancy is
guaranteed in
both leaf and
index pg splits.
Note difference
between copyup and push-up;
be sure you
understand the
reasons for this.
Entry to be inserted in parent node.
(Note that 5 is
s copied up and
continues to appear in the leaf.)
5
2*
3*
5*
7*
17
5
13
24
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
8*
Entry to be inserted in parent node.
(Note that 17 is pushed up and only
appears once in the index. Contrast
this with a leaf split.)
30
31
Deleting a Data Entry from a B+ Tree
Start at root, find leaf L where entry belongs.
 Remove the entry.



If L is at least half-full, done!
If L has only d-1 entries,
• Try to re-distribute, borrowing from sibling (adjacent
node with same parent as L).
• will have to change parent a bit (see example)
• If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to L
or sibling) from parent of L.
 Merge could propagate to root, decreasing height.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
32
Deleting 19*
Root
17
5
2*
3*
24
13
5*
7* 8*
14* 16*
19* 20* 22*
30
24* 27* 29*
33* 34* 38* 39*
Just remove node, doesn’t change tree
Root
17
5
2*
3*
24
13
5*
7* 8*
14* 16*
20* 22*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
30
24* 27* 29*
33* 34* 38* 39*
33
Now Deleting 20*
17
5
2*
3*
24
13
5*
7* 8*
14* 16*
20* 22*
30
24* 27* 29*
33* 34* 38* 39*
Must borrow from a sibling if possible
17
5
2*
3*
27
13
5*
7* 8*
14* 16*
22* 24*
30
27* 29*
33* 34* 38* 39*
Notice how the middle key is copied up
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
34
Now Deleting 24*
17
5
2*
3*
27
13
5*
7* 8*
22* 24*
14* 16*
30
33* 34* 38* 39*
27* 29*
Can’t borrow from any sibling; must merge
5
2*
3*
5*
7*
8*
13
14* 16*
17
30
22* 27* 29*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
33* 34* 38* 39*
35
... Deleting 24*
Must merge.
 Observe `toss’ of
index entry (on right),
and `pull down’ of
index entry (below).

30
22*
27*
29*
33*
34*
38*
39*
Root
5
2*
3*
5*
7*
8*
13
14* 16*
17
30
22* 27* 29*
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
33* 34* 38* 39*
36
Comparison
• Hash
• B+ tree
37
Types of Indexes
• clustered vs. non-clustered
• primary vs. secondary
38
A non-clustered index
39
A clustered index
40
A clustered index
41
Indexes over composite search keys

Composite Search Keys: Search on a
combination of fields.
 Equality query: Every field
value is equal to a constant
value. E.g. wrt <sal,age> index:
• age=20 and sal =75
 Range query: Some field value
is not a constant. E.g.:
• age =20; or age=20 and sal
> 10
Examples of composite key
indexes using lexicographic order.
11,80
11
12,10
12
12,20
13,75
<age, sal>
10,12
20,12
75,13
name age sal
bob 12
10
cal 11
80
joe 12
20
sue 13
75
13
<age>
10
Data records
sorted by name
80,11
<sal, age>
Data entries in index
sorted by <sal,age>
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke
12
20
75
80
<sal>
Data entries
sorted by <sal>
42