Transcript Lecture 10
Indexing 1 Motivation Sells( bar, Joe’s Joe’s Sue’s Sue’s beer, Bud Miller Bud Coors price ) 2.50 2.75 2.50 3.00 Bars( bar, addr ) Joe’s Maple St. Sue’s River Rd. Query: Find all locations that sell beer for 2.75. Select addr From Sells, Bars Where (Sells.bar = Bars.bar) and (Sells.price = 2.75) 2 Sells( bar, Joe’s Joe’s Sue’s Sue’s beer, Bud Miller Bud Coors price ) 2.50 2.75 2.50 3.00 Bars(bar, Joe’s Sue’s addr ) Maple St. River Rd. PROJECTBars.addr SELECTSells.price = 2.75 JOIN Sells Sells.bar = Bars.bar Bars 3 Sells( bar, Joe’s Joe’s Sue’s Sue’s beer, Bud Miller Bud Coors price ) 2.50 2.75 2.50 3.00 Bars(bar, Joe’s Sue’s addr ) Maple St. River Rd. PROJECTBars.addr JOIN Sells.bar = Bars.bar SELECTSells.price = 2.75 Sells Can speed up execution if we have an index on price Bars 4 Example of Using an Index 5 Index should be on disk; here’s a disk-based hash index example 6 The whole table could be organized as an index; here’s a table organized as a hash 7 Summary • Indexes help searching for information far more efficiently – speed up query execution • Hash indexes – have to be disk based – can be outside the table, pointing into records in table – or the whole table can be organized as a hash index • Great for equality search • Not so great for range search 8 Range Search Sells( bar, Joe’s Joe’s Sue’s Sue’s beer, Bud Miller Bud Coors price ) 2.50 2.75 2.50 3.00 Bars( bar, addr ) Joe’s Maple St. Sue’s River Rd. Query: Find all locations that sell beer between 2.75 and 3 Select addr From Sells, Bars Where (Sells.bar = Bars.bar) and (Sells.price >= 2.75) and (Sells.price <= 3) 9 Hashes are not well suited for range search 10 A better idea: organize table as a sorted file 11 We need more than just a sorted file: B+ trees 12 The B+ tree index can be outside the table 13 A larger B+ tree example; here leaves point to where the data records reside Note how the leaves are chained together, to help range search 80 20 10 10 15 15 18 60 100 20 18 20 30 30 40 40 50 60 50 60 65 120 140 80 65 80 85 85 90 90 14 B+ Trees Basics • Parameter d = the degree • Each node has >= d and <= 2d keys (except root) – if a node has n keys, then it has (n+1) pointers to lower-level nodes • Each leaf has >=d and <= 2d keys – if a leaf has n keys, then it has n pointers to the data records with those keys – and also a pointer to the next leaf (to facilitate range query answering) 15 B+ Tree Example d=2 each non-root node must have at least 2 keys 80 20 k < 20 10 60 100 20 <= k < 60 15 18 20 120 140 60 <= k 30 40 50 60 65 80 85 90 pointer to data records 10 15 data records 18 20 30 40 50 60 65 80 85 90 16 B+ Tree Design • How large is d ? • Example: – Key size = 4 bytes – Pointer size = 8 bytes – Block size = 4096 byes • 2d x 4 + (2d+1) x 8 <= 4096 • d = 170 17 Searching a B+ Tree • Equality queries (i.e., given exact key values): – Start at the root – Proceed down, to the leaf • Range queries: – As above – Then sequential traversal Select name From people Where age = 18 Select name From people Where 20 <= age and age <= 65 18 Searching a B+ Tree 80 20 10 10 15 15 18 60 100 20 18 20 30 30 40 40 50 60 50 60 65 120 140 80 65 80 85 85 90 90 19 20 B+ Trees in Practice • Typical order: 100. Typical fill-factor: 67%. – so average fanout = 200*0.67% = 133 • Typical capacities: – Height 4: 1334 = 312,900,700 records – Height 3: 1333 = 2,352,637 records • Can often hold top levels in buffer pool: – Level 1 = 1 page = 8 Kbytes – Level 2 = 133 pages = 1 Mbyte – Level 3 = 17,689 pages = 133 MBytes 21 22 Sample Exam Question • Assume a file which has 950 thousands (that is, 950000) records. Assume also that we are indexing this file using a B+ tree. In this particular B+ tree, the average page has an occupancy of 100 pointers (that is, the tree's average branching factor is 100). • Assume further that the amount of memory set aside for storing the index is 150 blocks, and that all 950000 records of the above file reside on disk. Assume no duplicate search key exists, given a search key K, compute the minimal number of disk I/O needed to retrieve the record with that search key. Explain your answer. 23 Solution • Assume each leaf node points to data records – so leaf nodes do not contain data records • We have 950,000 records, so need 950,000 pointers from leaf nodes • Each leaf node can point to 100 records, so need 9,500 leaf nodes • So tree has three levels: root at level 0, 100 nodes at level 1, 10000 nodes at level 2 • Have 150 memory pages for index, so can store root node + all of level 1 and some of level 2 • So minimal cost is 1 24 Some Math • Assume tree has fanout of F (that is, each nonleaf node has F pointers to lower-level nodes) • Assume tree has N leaf pages • Then the number of levels in tree is log_F(N) • So each insert/delete/lookup costs roughly log_F(N) – assuming the entire tree is on disk • This assumes that tree will stay height-balanced • Insert/delete must ensure that tree stay heightbalanced 25 To explain insert/delete, we will use the following B+ tree, with a slightly different notation note how pointers are specified using a different notation here Root 13 2* 3* 5* 7* 14* 16* 17 24 19* 20* 22* 30 24* 27* 29* 33* 34* 38* 39* 2* means a pair of (key, pointer to data record) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 26 Inserting a Data Entry into a B+ Tree Find correct leaf L. Put data entry onto L. If L has enough space, done! Else, must split L (into L and a new node L2) • Redistribute entries: d in L, (d+1) in L2, copy up middle key. • Insert index entry pointing to L2 into parent of L. Then if parent of L needs to be split up, do so; this can happen all the way to the root To split a non-leaf node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. Tree growth: gets wider or one level taller at top. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 27 Root Insert 8* 3* 2* 5* 7* 13 17 24 19* 20* 22* 14* 16* 30 24* 27* 29* 33* 34* 38* 39* first insert 8* in here see that node is full; so have to split Entry to be inserted in parent node. (Note that 5 is s copied up and continues to appear in the leaf.) 5 2* 3* 5* 7* 8* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 28 insert 5 in here, but now this node is full 13 5 2* 3* 5* 7* 8* 14* 16* 17 5 13 17 24 24 30 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.) 30 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 29 Example B+ Tree After Inserting 8* Root 17 5 2* 3* 24 13 5* 7* 8* 14* 16* 19* 20* 22* 30 24* 27* 29* 33* 34* 38* 39* Notice that root was split, leading to increase in height. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 30 Note on Inserting 8* into Example B+ Tree Observe how minimum occupancy is guaranteed in both leaf and index pg splits. Note difference between copyup and push-up; be sure you understand the reasons for this. Entry to be inserted in parent node. (Note that 5 is s copied up and continues to appear in the leaf.) 5 2* 3* 5* 7* 17 5 13 24 Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 8* Entry to be inserted in parent node. (Note that 17 is pushed up and only appears once in the index. Contrast this with a leaf split.) 30 31 Deleting a Data Entry from a B+ Tree Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, • Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). • will have to change parent a bit (see example) • If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 32 Deleting 19* Root 17 5 2* 3* 24 13 5* 7* 8* 14* 16* 19* 20* 22* 30 24* 27* 29* 33* 34* 38* 39* Just remove node, doesn’t change tree Root 17 5 2* 3* 24 13 5* 7* 8* 14* 16* 20* 22* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 30 24* 27* 29* 33* 34* 38* 39* 33 Now Deleting 20* 17 5 2* 3* 24 13 5* 7* 8* 14* 16* 20* 22* 30 24* 27* 29* 33* 34* 38* 39* Must borrow from a sibling if possible 17 5 2* 3* 27 13 5* 7* 8* 14* 16* 22* 24* 30 27* 29* 33* 34* 38* 39* Notice how the middle key is copied up Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 34 Now Deleting 24* 17 5 2* 3* 27 13 5* 7* 8* 22* 24* 14* 16* 30 33* 34* 38* 39* 27* 29* Can’t borrow from any sibling; must merge 5 2* 3* 5* 7* 8* 13 14* 16* 17 30 22* 27* 29* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 33* 34* 38* 39* 35 ... Deleting 24* Must merge. Observe `toss’ of index entry (on right), and `pull down’ of index entry (below). 30 22* 27* 29* 33* 34* 38* 39* Root 5 2* 3* 5* 7* 8* 13 14* 16* 17 30 22* 27* 29* Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 33* 34* 38* 39* 36 Comparison • Hash • B+ tree 37 Types of Indexes • clustered vs. non-clustered • primary vs. secondary 38 A non-clustered index 39 A clustered index 40 A clustered index 41 Indexes over composite search keys Composite Search Keys: Search on a combination of fields. Equality query: Every field value is equal to a constant value. E.g. wrt <sal,age> index: • age=20 and sal =75 Range query: Some field value is not a constant. E.g.: • age =20; or age=20 and sal > 10 Examples of composite key indexes using lexicographic order. 11,80 11 12,10 12 12,20 13,75 <age, sal> 10,12 20,12 75,13 name age sal bob 12 10 cal 11 80 joe 12 20 sue 13 75 13 <age> 10 Data records sorted by name 80,11 <sal, age> Data entries in index sorted by <sal,age> Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 12 20 75 80 <sal> Data entries sorted by <sal> 42