Why Not Store Everything in Main Memory? Why use disks?

Transcript Why Not Store Everything in Main Memory? Why use disks?

Indexes

• A Heap file allows record retrieval:

• by specifying the rid , or • by scanning all records sequentially

• Sometimes, retrieval of records by specifying the values in one or more fields is needed

(semantic search or value-based query), e.g., • Find all students in CS dept; Find students with gpa > 3 – Indexes are files (separate from the data file they that enable answering these value-based queries efficiently.

index

) – Indexes contain “ search keys ”, k, which are values from the attribute being indexed and “ value (usually pointers).

data entries ”, k*, which lead us to the records containing the search key

Index Classification

• Primary vs. secondary the clustered primary key, then it is called a primary index : If the search key contains , else it is called a secondary index .

• Clustered vs. unclustered : If the closeness of the data records is the same as the closeness of the data entries, the index is called a clustered index.

– A file can be clustered on at most 1 attribute ( search key ) – Cost of retrieving data records through an index varies

greatly

based on whether index is clustered or not!

Primary Index

PRIMARY INDEX: I(k,p)

k = ordered or clustered "key" field values from ordered or clustered field of file with

uniqueness property

(individual value occurrences are "unique" i.e., each value can occur at most once.) p = pointer to page containing record(s) with value, k

Primary indexes can be either: DENSE:

(every record is indexed) or

NON-DENSE:

only key-values of records at the beginning of a page are indexed (anchor record of page). (and then the pointer is page-# only)

Example:

Assume the blocking factor (bfr) is 2 which means 2 records/page.

STUDENT Non-dense Primary Index on S# |S#|SNAME |LCODE |pg |S#|pg |17|BAID |NY2091|1 |17| 1 |25|CLAY |NJ5101|1 |32| 2 |32|THAISZ|NJ5102|2 |57| 3 |38|GOOD |FL6321|2 |57|BROWN |NY2092|3 |83|THOM |ND3450|3 Dense Primary Index on S# |S#|pg offset |17| 1 0 |25| 1 1 |32| 2 0 |38| 2 1 |57| 3 0 |83| 3 1 RID Inserting and deleting are major problems. - must move records to maintain ordering - anchors change (in non-dense case)

Clustering Index

like a primary index except that the attribute nee not be a key - the file must be clustered on the attribute, k - the pointer for any k is the address of 1st page with that k-value

ENROLL2

|S#|C#| GRADE pg |17|6 | 96 |1 |25|6 | 76 |1 |32|6 | 62 |2 |38|6 | 98 |2 |32|6 | 91 |3 |25|7 | 68 |3 |32|8 | 89 |4 |17|9 | 95 |4 |C#|pg| Dense Clustering_Index on C# |6 | 1| |7 | 3| |8 | 4| |9 | 4| |C#|pg| Non-dense Clustering_Index on C# |6 | 1| (indexing new anchor records only) |8 | 4| There's no more search overhead with this 2 nd type of non-dense clustering index, but - How can you know which page has C#=7?

(search pages starting at pg=1) - How can you know which page has C#=9?

(search pags starting at pg=4)

Secondary Index

These indexes are the same as the previous except, - the file is need not be clustered on k - p points to the page or record containing k - every record must be indexed (dense) Option1: If there are multiple occurences of k, use multiple index entries for that k.

S#|C#| GRADE ENROLL (unclustered C#) 32|8 | 89 |1 25|6 | 76 |1 32|6 | 62 |2 25|8 | 86 |2 38|6 | 98 |3 32|7 | 91 |3 17|5 | 96 |4 25|7 | 68 |4 17|8 | 95 |5 Option2: Use repeating groups of pointers (requires variable length pointer(s) |C#|page |5 | 4 |6 | 1,2,3 |7 | 3,4 |8 | 1,2,4 C#|pg Secondary_Index, Option1 on C# 5 | 4 6 | 1 6 | 2 6 | 3 7 | 3 7 | 4 8 | 1 8 | 2 8 | 4 Option3: Use 1 index entry for each value, 1 pointer to "list" or "linked list" of record pointers. (1 level of indirection) |S#|C#| GRADE pg ENROLL (unclustered C#) |32|8 | 89 |1 |25|6 | 76 |1 |32|6 | 62 |2 |25|8 | 86 |2 |38|6 | 98 |3 |32|7 | 91 |3 |17|5 | 96 |4 |25|7 | 68 |4 |17|8 | 95 |5 |C#| page Secondary_Index, opt3 on C# |5 | -->|4| |6 | -->|1|->|2|->|3| |7 | -->|3|->|4| |8 | -->|1|->|2|->|4|

Multi-level Index

(made up of an index on an index) For any index, since it is a file clustered on the key, k, it can have a primary or clustering index on it. (constituting the second level of the multilevel index). STUDENT |S#|SNAME |LCODE |pg |17|BAID |NY2091|1 |25|CLAY |NJ5101|1 |32|THAISZ|NJ5102|2 |38|GOOD |FL6321|2 |57|BROWN |NY2092|3 |83|THOM |ND3450|3 |91|PARK |MN7334|4 |94|SIVA |OR1123|4 |S#|pg|pg (of index file) S#-index (nondense, primary) |17| 1|1 |32| 2|1 |57| 3|2 |91| 4|2 |S#|pg| |17| 1| |57| 2| 2nd_LEVEL (a second level, nondense index)

Index Classification (Contd.)

• If there is at least one index entry per existing attribute value, then it is called dense , else sparse

Anchor records of each page

Ashby Name, age, bonus Ashby, 25, 3000 Basu, 33, 4003 Bristow, 30, 2007 22 25 30 33 Cass, 50, 5004 Cass Smith Daniels, 22, 6003 40 Jones, 40, 6003 • Every sparse index must be clustered! Sparse indexes are smaller.

Sparse Index on Name Smith, 44, 3000 Tracy, 44, 5004 Data File 44 44 50 Dense Index on Age

• Tree-structured indexing techniques support

both range searches (AKA inequality searches) and equality searches .

• ISAM :

( variation of multilevel clustering ) static structure;

• B+ tree :

dynamic, adjust gracefully under insert and delete.

ISAM

1 index entry per page of data file, of the form: sorted on the attrribute value, k.

k* points to 1 st page (possibly) containing k.

Provides alternate entry points into the file – faster than binary search which has just one entry point.

K* 0 index entry K 1 K* 1 K 2 K* 2 K m K*m Index file may still be quite large. But we can apply the idea repeatedly!

Non-leaf (inode Leaf Pages Overflow page Primary pages

Leaf pages contain data entries, . In inodes, k*=indirect ptr.

**Example ISAM Tree • Where each node can hold 2 (k,k*) entries • in any internal node or inode (non-leaf) add ptr for key_values < the first k-value**

Root 40,40* 20,20* 33,33* 51,51* 63,63* 10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 46,46* 51,51* 55,55* 63,63* 97,97*

Insert k=23 Insert k=48 Insert k=41 Insert k=42 Index Pages 40,40* 20,20* 33,33* 51,51* 63,63* Primary Leaf Pages 10,10* 15,15* Overflow Pages 20,20* 27,27* Need overflow page 23,23* 33,33* 37,37* 40,40* 46,46* Need overflow page 48,48* 41,41* Need overflow page 42,42* 51,51* 55,55* 63,63* 97,97*

Deleting 42 Deleting 51 Deleting 97 40,40* 20,20* 33,33* 51,51* 63,63* 10,10* 15,15* 20,20* 27,27* 33,33* 37,37* 40,40* 46,46* 51,51* 55,55* 63,63* 97,97* 23,23* 48,48* 41,41* 42,42*

**Note that 51* appears in index levels, but not in leaf!**

B+ Tree: The Most Widely Used Index

• keeps tree height-balanced . • Minimum 50% occupancy (except for root). Each node contains m entries, where d



m



2d.

– d is called the

degree or order

of the index.

• Supports equality and range-searches efficiently.

Index Entries (“Direct search set or index set”) Data Entries ("Sequence set")

Example B+ Tree (d=2)

• Search begins at root, key comparisons direct it to a leaf.

• Search for 5 • Search for15 • Search for all data entries



24 Root 13 17 24 30 Leaves are doubly linked for fast sequential < search 2* 3* 7* 14* 16* 19* 20* 22* 15 is not in the file!

24* 27* 29* 33* 34* 38* 39*

Example B+ Tree (contd.)

• Search for all data entries < 23 • (note, this is the reason for the double linkage).

Root 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Inserting a Data Entry into a B+ Tree

• Find correct leaf

• Put data entry in

.

– – If

L

has enough space,

done

Else, must

split L (into L and a new node L2)

• Redistribute entries, copy up (promote) middle key.

• middle value which was promoted and is now the anchor key for L2) .

• This can happen recursively

(e.g., if there is no space for the promoted middle value in the inode to which it is promoted) – To split inode , redistribute entries evenly, but push up (promote) middle key.

• So promote means Copy up at leaf; Move up at inode.

• Splits “grow” tree • only a root split increases height. – Only tree growth possible:

wider

1 level taller at top.

Inserting 8*

Entry to be inserted in parent node.

(Note that 17 is moved up and only appears once in the index. Contrast this with a leaf split.) 17 No room for 5, so split and move 17 up.

5 5 17 24 5* 30 7* No room for 8, so split.

2* 3* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* 5 to be inserted in parent node.

(Note that 5 is continues to appear in the new leaf node, L2, as anchor value.) Observe how minimum occupancy is guaranteed in both leaf and index pg splits.

• Note difference between copy-up (leaf) and move-up (inode)

Root 13 17 24 30 2* 3* 5* 7* 19* 20* 22* 14* 16*

B+ Tree Before Inserting 8*

24* 27* 29* 33* 34* 38* 39*

After Inserting 8*

Root 17 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Note height_increase, balance and occupancy maintenance.

Deleting a Data Entry from a B+ Tree

• Start at root, find leaf

where entry belongs.

• Remove the entry.

– If L is at least half-full,

done!

– If L has only d-1 entries, • Try to re-distribute , borrowing from

sibling (adjacent node with same parent as L)

• If re-distribution fails,

merge L

and a sibling.

• Merge could propagate to root, and therefore decreasing height.

2* 3* Root 5 13 5* 7* 8* 14* 16* 17 Example Tree After Inserting 8* Then Deleting 19*, 20* Root 17 5 13 19* 20* 22* 24 30 24* 27* 29* 33* 34* 38* 39* 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* • Deleting 19* is easy.

• Deleting 20* is done with re-distribution of 24* (and revision of anchor value (from 24 to 27) in inode.

... And Then Deleting 24* 5 13 17 27 30 2* 3*

• Must merge.

5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* 27 17 5 13 30 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*

• Observe `

toss

’ of index entry, 27, now that inode is below min occupancy so merge it with its sibling • and index entry, 17 can be `

pulled down

’ (sibling merge, followed by pull-down)

Root 5 13 17 30 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*

Summary so far

• Tree indexes are ideal for range-searches and equality searches.

• ISAM is a static structure.

– – Only leaf pages modified; overflow pages needed.

Overflow chains can degrade performance unless size of data set and data distribution stay constant.

• B+ tree is a dynamic structure.

– Inserts/deletes leave tree height-balanced.

– High fanout (F) means depth rarely more than 3 or 4.

– – Almost always better than maintaining a sorted file.

Typically, 67% occupancy on average.

– Usually preferable to ISAM – adjusts to growth gracefully.

– Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS.

– Caution! There is much variation in implementation

Multidimensional Index

Multidimensional data almost always requires multidimensional indexing for effective access. One dimensional indexes assume a single search column, attribute (search key) which can be a composite column or key. Data structures, that support queries into multidimensional data specifically, fall in two categories: 1. Hash-table-like (e.g., Grid files and partitioned hash fctns) 2. Tree-like, eg, multi-key indexes, kd-trees, quad-trees (for sets of points); R-trees (for sets of regions as well as sets of points) ), Predicate-trees (P-trees) for vertical compressed, representations of data

Hash-like Structures for Multidimensional e.g., Data Grid Files

Partition the POINTS space into a grid. In each dimension "grid lines" partition space into stripes. Points that fall right on a grid line belong to the stripe above it (i.e., grid-lines are the lower boundaries).

Example:

12 customer(age,sal) data points (i.e., records or tuples)

(

age,sal):

(24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260)

Grid hash function age sal points range range

0-39 0-89K (24,60) 40-55 0-89K (46,60) (50,80) 40-55 90-223K (50,100) (50,120) 56-99 90-223K (70,100) (84,140) 0-39 224-400K (30,260) (26,400) 40-55 224-400K (44,360) (50,280) 56-99 224-400K (60,260) Inserting into Grid files: If there is room, insert, else (two methods) 1. add overflow block and chain it to the primary block, or 2. reorganize the structure by adding or moving grid lines (similar to dynamic hashing) If vertical grid lines at

age=40, age=65

40 400K 380K 360K 340K 320K 300K 280K 260K 240K 220K 200K 180K 160K 140K 120K 100K 80K 60K 40K 20K 0K A problem with Grid files is that the number * * * * * * * and horizontal at 56 * *

SAL=90K, SAL=224K

0 10 20 30 40 50 60 70 80 90 100



AGE

of buckets grows exponentially with dimension and the grid may become sparse.

Hash-like Structures for Multidimensional e.g., Partitioned hash Files

is a sequence of hash functions, h=(h 1 ,...h

n ) such that h i produces the i th segment of bits in the hash key, that is, h(a) is the concatenation of bit subsequences, h 1 (a)h 2 (a)..h

n (a). Example: The data file is CUSTOMER(AGE,SAL) consisting again of (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260) Use 2 hash functions and 3 bits, the and the last 2 bits 1 st bit is for age with hash function, mod 2 (tens_digit of age) are for salary with hash function, mod 4 (hundreds_digit of sal) The lookup table is: Partitioined hash function

key points 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 ( 2 4, 0 60) ( 4 6, 0 60) ( 2 6, 4 00) ( 8 4, 1 40) ( 6 0, 2 60) ( 4 4, 3 60) ( 5 0, 0 80) ( 5 0, 1 00) ( 5 0, 1 20) ( 7 0, 1 00) ( 3 0, 2 60) ( 5 0, 2 80)

Tree-like Structures for Multidimensional e.g., Multi-key Index

Assume several attributes representing "dimensions" of the data points (data cube tuples) - uses a multi-level index, e.g., suppose there are 2 attributes: Provides a second level of Indexes on 2 nd attribute to all tuples with same 1 st attribute value / | /|--> / |--> Index on .--> < |--> 1st attr / \ |..

\ /|/ \|--> / | /|--> / | / |--> --> < |----> < |--> | \ |..

\ |\ \|--> \ | \ \| \ \ \ /|--> / |--> \ \ `>< |--> \ |..

\ \|--> \ `-> . . .

indexes on 2nd attr Take the (age, salary) points again (24,60) (24,260) (24,400) (50,80) (50,100) (50,120) (50,280) (60,100) (60,260) (84,140) \ . - - - - - - - - - - -> (24,060) / .- - - - - - -> (24,060) / / .- - - -> (24,400) ___/_________/______/ .--> |_60_|_260_|_400_____| / .- - - - - - -- - - - - -> (50,080) age / ____/________________ 24----' .-> |_80_|_100|_120_|_280_|- - - --> (50,280) 50-------' \ `- - - - - - - --> (50,120) 60. _______ `- - - - - - - - - - -> (50,100) 84 `-->|100|260|- - - - - - - - - - - - - --> (60,260) `- - - - - - - - - - - - - - - - --> (60,100) \ _____ `- >|_140_|- - - - - - - - - - - - - - --> (84,140)

Tree-like Structures for Multidimensional e.g., k dimensional ( kd tree ) Index

Interior nodes have (Attribute, Value, LowPointer, HighPontr) - Value is a value which splits data points - The example below will show (a, V, down, up) with pointers going down for LowPointer and up for HighPointer. (goes up on greater or equal actually). - Attributes used for different levels are different and ROTATE among the dimensions (round robin). - The leaves are blocks of records (assume data blocks hold 2 records, i.e., the blocking factor, bfr, is 2). - to search: decide along the tree until you reach a leaf (going up on greater or equal) - to insert: decide along the tree until you reach the proper leaf if there is room there, insert; else split the block and divide its contents according to the appropriate attribute (next one in the rotation). Example: (insert into kd-tree in this order using age first then salart, sal): age,sal (50,80) (84,140) (30,260) (44,360) (50,120) (70,100) (24,60) (26,400) (50,280) (46,60) (60,260) (50,100) insert the first 2 pairs (no tree yet, since just 1 leaf block): 50, 80 84, 140 age sal 30, 260 ( leaf is full so split it and divide the contents by sal=150) 30,260 sal / { ,150} < \ 50,80 84,140

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued

30,260

sal / { ,150} < \

50,80 84,140

age sal 44,360 ( leaf is not full so insert) sal

30,260 44,360

/ { ,150} < \

50,80 84,140

age sal 50,120 ( leaf is full 30,260 44,360 sal / so split, divide contents by age=55) { ,150} < \ 84,140 \age / { 55, } < \ 50,80 50,120

age sal 50,120 sal { ,150} < \ 30,260 44,360 /

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued

84,140 \age / { 55, } < \ 50,80 50,120 age sal 70,100 (leaf is not full 30,260 44,360 sal / so insert) { ,150} < \ 84,140 70,100 \age / { 55, } < \ 50,80 50,120 age sal 24,060 ( leaf is full 30,260 44,360 sal / { ,150} < \ so split, divide by sal=75) 84,140 70,100 \age / { 55, } < \ \ 50,080 50,120 \ sal / { ,75)< \ 24,060

sal 30,260 44,360 / { ,150} < \ 84,140 70,100

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued

\age / { 55, } < \ \ 50,080 50,120 \ sal / { ,75)< \ 24,060 age sal 26,400 ( leaf full split, div by age=28) / / age { 28, }< / 30,260 44,360 \ 26,400 sal { ,150} < / / 84,140 \ 70,100 \age / { 55, } < \ 50,080 \ 50,120 \ sal / { ,75)< \ 24,060 age sal 50,280 ( leaf full split, div by sal=300) / / age / { 28, }< \ sal (300, }< / 44,360 \ 30,260 50,280 26,400 sal / / { ,150} < \ 84,140 70,100 \age / { 55, } < \ \ 50,080 50,120 \ sal / { ,75)< \ 24,060

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued

/ / age / { 28, }< \ sal / (300, }< 44,360 \ 30,260 50,280 26,400 sal / / { ,150} < \ 84,140 70,100 \age / { 55, } < \ \ 50,080 50,120 \ sal / { ,75)< \ 24,060 age sal 46,060 (leaf not full so insert) sal { ,150} < / \ / / age / / { 28, }< \ sal / (300, }< 44,360 26,400 \ 30,260 50,280 84,140 70,100 \age / { 55, } < \ \ 50,080 50,120 \ sal / { ,75)< \ 24,060 46,060

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued

/ / age / { 28, }< \ sal / (300, }< 44,360 \ 30,260 50,280 26,400 sal / / { ,150} < \ 84,140 70,100 \age / { 55, } < \ \ 50,080 50,120 \ sal / { ,75)< \ 24,060 46,060 age sal 60,260 (leaf full split by age=40) sal { ,150} < / \ / / age / / { 28, }< \ sal / (300, }< 44,360 26,400 \ \ age { 40, }< / \ 30,260 50,280 60,260 84,140 70,100 \age / { 55, } < \ \ 50,080 50,120 \ sal / { ,75)< \ 24,060 46,060

sal / (300, }< 44,360 / / age / { 28, }< \ 26,400 \ \ age / { 40, }< \ 30,260

Tree-like Structures for Multidim e.g., k dim (kd tree) Index continued

sal / / { ,150} < \ 84,140 70,100 \age / { 55, } < \ \ 50,080 50,120 \ sal / { ,75)< \ 24,060 46,060 50,280 60,260 age sal 50,100 (full split age=50 full again split sal=90) / / age / / { 28, }< \ sal / (300, }< 44,360 26,400 \ \ age / { 40, }< \ 30,260 50,280 60,260 sal / { ,150} < \ 84,140 70,100 \age / { 55, } < sal { , 90 }< / 50,120 50,100 \ \ \ age { 50, }< / sal / { ,75)< \ 24,060 46,060 \ \ 50,080

Tree-like Structures for Multidimensional datasets e.g., Quad tree indexes

- Interior nodes (Inodes) correspond to rectangulars in 2-D (more generally, they can be constructed to represent hypercubes higher dimensional space) - If the number of points in the rectangle fits in a block, it's a leaf, else the rectangle is treated as interior node with children corresponding to its 4 quadrants. - to insert into the quad treee index: search to find the proper leaf; if there's room, insert; else split node into 4 quadrants, divide contents appropriately. Example: Build the Quad-tree index as it would develop, assuming (age,sal) arrive in this order: age,sal (24,60) (46,60) (50,80) (50,100) (50,120) (70,100) (84,140) (30,260) (26,400) (44,360) (50,280) (60,260) Insert (24,60) (46,60) The only leaf node is: age sal 24,060 46,060 400K 380K 360K 340K 320K 300K 280K 260K 240K 220K 200K 180K 160K 140K 120K 100K 80K 60K 40K 20K 0K * *

0 10 20 30 40 50 60 70 80 90 100



AGE

Tree-like Structures for Multidim datasets e.g., Quad tree indexes

24,060 46,060 insert age sal

50, 080

( leaf full split (e.g., at

age=50

.-NW / /---NE age,sal / {50,200} < \ \---SW 24,060 \ 46,060 \ `SE 50,080 and

sal=200

) divide contents by quadrant 400K 380K 360K 340K 320K 300K 280K 260K 240K 220K 200K 180K 160K 140K 120K 100K 80K 60K 40K 20K 0K * * *

0 10 20 30 40 50 60 70 80 90 100



AGE

.-NW / /---NE age,sal / {50,200} < \ \---SW 24,060 \ 46,060 \ `SE 50,080 400K 380K 360K 340K 320K 300K 280K 260K 240K 220K 200K 180K 160K 140K 120K 100K 80K 60K 40K 20K 0K

Tree-like Structures for Multidim datasets e.g., Quad tree indexes

insert

50, 100

( not full / .-NW /---NE age,sal / {50,200} < \ \---SW 24,060 \ 46,060 / /---NE age,sal / {50,200} < \ \---SW 24,060 \ 46,060 \ insert `SE 50,080 50,100 insert

50, 120

full split SE at

100

.-NW .-NW 50,100 / 50,120 /---NE \ `

SE( 75 , 100 )<

/ \ \---SW 50,080 \ \ `SE * * * *

ETC.

0 10 20 30 40 50 60 70 80 90 100



AGE

Tree-like Structures for Multidim datasets: Region tree (Rtree) indexes

- inodes of an R-tree correspond to interior regions, (which can be overlapping) (usually regions are rectangles, tho, not necessarily) - R-tree regions have subregions that represent the contents of their children - And the subregions need not cover the region they subdivide (but all data must be within a subregion) Example, Consider the spatial image: Assume a leaf can hold 6 regions (bfr=6) Example: Consider the spatial image: and that the 6 regions or objects above are together on 1 leaf block, whose region 100________________________________________________________ | | | | | | | | .---------. | | | school | |_________| | | |---------------------------. | | | road1 | |---------------------------| | |r | .-------. |house2 | |_______| .------.________ |o_|____________________________ |house1|________ |a_|________pipeline____________ _ _ | | | | | | | | | | | |______| |d | |2 | | | | | | | | | | | | | | is shown as the outer red rectangle Thus the R-tree has a root and 1 leaf: ( (0,0), (100,90) ) (corners of outer | | | 0 `-------------------------------------------------------' 0 | 100 red region) road1 road2 house1 school house2 pipeline (a full leaf with 6 objects)

Rtree indexes cont.

(0,0), (100,90) road1 | road2 | house1 (20,20), (100,80) POP 100________________________________________________________ | | | | | | | | | | .---------. | | | school | |_________| POP | | | | | | | | | | Split the full leaf putting 4 objects in 1 new leaf and 3 in the other (minimize overlap and split ~evenly) |---------------------------. | road1 | |---------------------------| | |r | | | | |______| |d | .-------. |house2 | |_______| .------.________ |o_|____________________________ |house1|________ |a_|________pipeline____________ _ _ | | | | | | | Now suppose a local cellular phone | | |2 | | | | | | | | 0 `-------------------------------------------------------' 0 | 100 company adds a POP as shown.

Rtree indexes cont.

(0,0), (100,90) (20,20), (100,80) road1 | road2 | house1 house2 house3 school | house2 | pipeline POP Note that house2 is in both regions.

100________________________________________________________ | | | | | | | | | | .---------. | | | school | |_________| POP | Since house3 is not in either region (and | | both have room) we must decide to | expand one of them.

| | If we pick the green, expanding it to | | (0,20), (100,80) we add 1600 units 2 | | If we pick the purple, expanding it to ((0,0), (80,50) we add 1000 units 2 |---------------------------. | road1 | |---------------------------| | | | | |______| |r | |d | .-------. |house2 | |_______| | | | | .------.________ |o_|____________________________ _ | |house1|________ |a_|________pipeline____________ _ | | so to minimize we pick the purple.

Now suppose we insert house3 | | | |2 | | | | | house3 0 `-------------------------------------------------------' 0 | | | 100

Binary Radix Tree Index (AKA a trie)

an additional index structure (e.g., used in IBM AS/400 systems)

Similar to B-tree, except - only the common parts of key values are embedded in inodes - a single bit is used to make the navigation direction decision at each level (0 for up and 1 for down). (zero-based bit positions are used) Example: (in this example, the tree structure is being built left-to-right) Starting with an empty structure, INSERT JAY | LA | 25 | STAR (assigned RRN=1 to it)

nam_trie_INDEX

name part RRN b3 J< JAY 1 ON 2

CUSTOMER FILE

RRN nam loc age job 1 | JAY | LA | 25 | STAR 2 | JON | LA | 45 | HOOD INSERT JON | LA | 45 | HOOD (assigned RRN=2 1st letters are teh same (J) so the common pat is embedded in the root 2nd letters: A and O, bit 3 (zero-based count) is 1 st difference (and makes the decision) 0123 4567  bit positions DBCDIC for A=1100 0001 EBCDIC for O=1101 0110

Binary Radix Tree Index (AKA a trie) cont.

nam_trie_INDEX

b3 J < b2 A< N 3 Y 1 ON 2 ON 2

CUSTOMER FILE

RRN nam loc age job 1 | JAY | LA | 25 | STAR 2 | JON | LA | 45 | HOOD 3 | JAN | RO | 93 | DOC INSERT JAN | RO | 93 | DOC (assigned RRN=3 0123 4567  bit positions DBCDIC for N=1101 0101 EBCDIC for Y=1110 0000

Binary Radix Tree Index (AKA a trie) cont.

nam_trie_INDEX

b2 < b3 J < b2 A< ON 2 SUE 2 N 3 Y 1

CUSTOMER FILE

RRN nam loc age job 1 | JAY | LA | 25 | STAR 2 | JON | LA | 45 | HOOD 3 | JAN | RO | 93 | DOC 4 | SUE | RO | 16 | PROG INSERT SUE | RO | 16 | PROG (assigned RRN=4 0123 4567  bit positions DBCDIC for J= 1101 0001 EBCDIC for Y= 1110 0010

Why Not Store Everything in Main Memory? Why use disks?

Transcript Why Not Store Everything in Main Memory? Why use disks?

Indexes

• A Heap file allows record retrieval:

• Sometimes, retrieval of records by specifying the values in one or more fields is needed

index

Index Classification

• Primary vs. secondary the clustered primary key, then it is called a primary index : If the search key contains , else it is called a secondary index .

• Clustered vs. unclustered : If the closeness of the data records is the same as the closeness of the data entries, the index is called a clustered index.

greatly

Primary Index

Clustering Index

Secondary Index

Multi-level Index

Index Classification (Contd.)

• Tree-structured indexing techniques support

• ISAM :

• B+ tree :

ISAM

Example ISAM Tree • Where each node can hold 2 (k,k*) entries • in any internal node or inode (non-leaf) add ptr for key_values < the first k-value

Note that 51* appears in index levels, but not in leaf!

B+ Tree: The Most Widely Used Index

• keeps tree height-balanced . • Minimum 50% occupancy (except for root). Each node contains m entries, where d

m

2d.

degree or order

• Supports equality and range-searches efficiently.

Example B+ Tree (d=2)

Example B+ Tree (contd.)

Inserting a Data Entry into a B+ Tree

• Find correct leaf

• Put data entry in

.

L

done

split L (into L and a new node L2)

• This can happen recursively

wider

1 level taller at top.

B+ Tree Before Inserting 8*

After Inserting 8*

Deleting a Data Entry from a B+ Tree

• Start at root, find leaf

where entry belongs.

• Remove the entry.

done!

sibling (adjacent node with same parent as L)

merge L

• Merge could propagate to root, and therefore decreasing height.

Summary so far

Multidimensional Index

Hash-like Structures for Multidimensional e.g., Data Grid Files

Hash-like Structures for Multidimensional e.g., Partitioned hash Files

Tree-like Structures for Multidimensional e.g., Multi-key Index

Tree-like Structures for Multidimensional e.g., k dimensional ( kd tree ) Index

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued

Tree-like Structures for Multidimensional e.g., k dimensional (kd tree) Index continued

Binary Radix Tree Index (AKA a trie)

Binary Radix Tree Index (AKA a trie) cont.

Binary Radix Tree Index (AKA a trie) cont.

Directory

**Example ISAM Tree • Where each node can hold 2 (k,k*) entries • in any internal node or inode (non-leaf) add ptr for key_values < the first k-value**

**Note that 51* appears in index levels, but not in leaf!**