Transcript Hashing
Hashing on the Disk Keys are stored in “disk pages” (“buckets”) several records fit within one page Retrieval: find address of page bring page into main memory searching within the page comes for free E.G.M. Petrakis Hashing 1 data pages Σ 0 key space 1 hash function 2 . . . . b m-1 page size b: maximum number of records in page space utilization u: measure of the use of space u E.G.M. Petrakis # stored records # pages b Hashing 2 Collisions Keys that hash to the same address are stored within the same page If the page is full: i. page splits: allocate a new page and ii. split page content between the old and the new page or overflows: list of overflow pages xxxx E.G.M. Petrakis Hashing overflow xx 3 Access Time Goal: find key in one disk access Access time ~ number of accesses Large u: good space utilization but many overflows or splits => more disk accesses Non-uniform key distribution => many keys map to the same addresses => overflows or splits => more accesses E.G.M. Petrakis Hashing 4 Categories of Methods Static: require file reorganization open addressing, separate chaining Dynamic: dynamic file growth, adapt to file size dynamic hashing, extendible hashing, linear hashing, spiral storage… E.G.M. Petrakis Hashing 5 Dynamic Hashing Schemes File size adapts to data size without total reorganization Typically 1-3 disk accesses to access a key Access time and u are a typical tradeoff u between 50-100% (typically 69%) Complicated implementation E.G.M. Petrakis Hashing 6 Schemes With Index Two disk accesses: one to access the index, one to access the data with index in main memory => one disk access Problem: the index may become too large index data pages Dynamic hashing (Larson 1978) Extendible hashing (Fagin et.al. 1979) E.G.M. Petrakis Hashing 7 Schemes Without Index Ideally, less space and less disk accesses (at least one) address space data space Linear Hashing (Litwin 1980) Linear Hashing with Partial Expansions (Larson 1980) Spiral Storage (Martin 1979) E.G.M. Petrakis Hashing 8 Hash Functions Support for shrinking or growing file shrinking or growing address space, the hash function adapts to these changes hash functions using first (last) bits of key = bn-1bn-2….bi b i-1…b2b1b0 hi(key)=bi-1…b2b1b0 supports 2i addresses hi: one more bit than hi-1 to address larger files hi 1 (key) hi (key) i h (key) 2 i 1 E.G.M. Petrakis Hashing 9 Dynamic Hashing (Larson 1978) Two level index primary h1(key): accesses a hash table secondary h2(key): accesses a binary tree Index: binary tree 1 2 3 4 h1(k) E.G.M. Petrakis st 1 level h2(k) Hashing 2nd level b data pages 10 Index Fixed (static): h1(key) = key mod m Dynamic behavior on secondary index h2(key) uses i bits of key the bit sequence of h2=bi-1…b2b1b0 denotes which path on the binary tree index to follow in order to access the data page scan h2 from right to left (bit 1: follow right path, bit 0: follow left path) E.G.M. Petrakis Hashing 11 0 1 2 3 4 5 h1(k) 1st level index 0 1 1 0 0 2 1 3 h1=1, h2=“0” h1=1, h2=“01” h1=1, h2=“11” h1=5, h2= any 4 b data pages h2(k) 2nd level h1(key) = key mod 6 h2(key) = “01”<= depth of binary tree = 2 E.G.M. Petrakis Hashing 12 Insertions Initially fixed size primary bindex and no data 0 1 2 3 0 1 2 3 h1=1,h2=any insert record in new page under h1address if page is full, allocate one extra page split keys between old and new page use one extra bit in h2 for addressing E.G.M. Petrakis 0 1 2 3 0 1 Hashing h1=1, h2=0 h1=1, h2=1 13 0 1 2 3 0 1 2 3 0 1 2 3 1 b 2 index 0 0 1 2 3 E.G.M. Petrakis h1=0, h2=any storage 1 h1=0, h2=0 3 h1=0, h2=1 2 h1=3, h2=any 1 3 4 2 h1=0, h2=0 5 Hashing h1=3, h2=any h1=0, h1=0, h1=3, h1=3, h2=01 h2=11 h2=0 h2=1 14 Deletions Find record to be deleted using h1, h2 Delete record Check “sibling” page: less than b records in both pages ? if yes merge the two pages delete one empty page shrink binary tree index by one level and reduce h2 by one bit E.G.M. Petrakis Hashing 15 1 0 1 2 3 merging 2 3 4 1 0 1 2 3 2 delete 3 4 E.G.M. Petrakis Hashing 16 Extendible Hashing (Fagin et.al. 1979) Dynamic hashing without index Primary hashing is omitted Only secondary hashing with all binary trees at the same level The index shrinks and grows according to file size Data pages attached to the index E.G.M. Petrakis Hashing 17 0 1 2 3 4 0 1 2 3 4 0 dynamic hashing 0 00 2 01 2 10 1 11 E.G.M. Petrakis Hashing dynamic hashing with all binary trees at same level number of address bits 18 Insertions Initially 1 index and 1 data page 0 address bits insert records in data page index global depth d: size of index 2d storage 0 0 local depth l : Number of address bits b E.G.M. Petrakis Hashing 19 Page “0” Overflows d index storage l 0 0 d: global depth = 1 l : local depth = 1 b d l 1 1 0 1 E.G.M. Petrakis 1 Hashing 20 Page “0” Overflows (cont.) 1 more key bit for addressing and 1 extra page => index doubles !! Split contents of previous page between 2 pages according to next bit of key Global depth d: number of index bits => 2d index size Local depth l : number of bits for record addressing E.G.M. Petrakis Hashing 21 Page “0” Overflows (again) contain records with same 2 bits of key d 00 01 10 11 2 2 2 l d 1 contains records with same 1st bit of key E.G.M. Petrakis Hashing 22 Page “01” Overflows d 000 001 010 011 100 101 110 111 E.G.M. Petrakis 3 2 3 3 1 Hashing 1 more key bit for addressing 2d-l: number of pointers to page 23 Page “100” Overflows 2 000 001 010 011 100 101 110 111 3 3 3 2 2 +1 no need to double index page 100 splits into two (1 new page) local depth l is increased by 1 E.G.M. Petrakis Hashing 24 Insertion Algorithm If l < d, split overflowed page (1 extra page) If l = d => index is doubled, page is split d is increased by 1=>1 more bit for addressing update pointers (either way): a) if d prefix bits are used for addressing d=d+1; for (i=2d-1, i>=0,i--) index[i]=index[i/2]; b) if d suffix bits are used for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1; d=d+1 E.G.M. Petrakis Hashing 25 Deletion Algorithm Find and delete record Check sibling page If less than b records in both pages merge pages and free empty page decrease local depth l by 1 (records in merged page have 1 less common bit) if l < d everywhere => reduce index (half size) update pointers E.G.M. Petrakis Hashing 26 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 3 E.G.M. Petrakis 2 3 3 delete with merging 3 2 2 2 2 2 l<d 2 2 00 01 10 11 2 2 2 2 Hashing 27 Observations A page splits and there are more than b keys with same next bit take one more bit for addressing (increase l) if d=l the index doubles again !! Hashing might fail for non-uniform distributions of keys (e.g., multiple keys with same value) if distribution is known, transform it to uniform Dynamic hashing performs better for nonuniform distributions (affected locally) E.G.M. Petrakis Hashing 28 Performance For n: records and page size b expected size of index (Flajolet) 1 (1 ) n b 1 (1 ) 3.92 n b l blog2 b 1 disk access/retrieval when index in main memory 2 disk accesses when index is on disk overflows increase number of disk accesses E.G.M. Petrakis Hashing 29 Storage Utilization with Page Splitting b b before splitting after splitting b u 50% After splitting 2b In general 50% < u < 100% On the average u ~ ln2 ~ 69% (no overflows) E.G.M. Petrakis Hashing 30 Storage Utilization with Overflows Achieves higher u and avoids page doubling (d=l) b b higher u is achieved for small overflow pages u=2b/3b~66% after splitting small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75% double index only if the overflow overflows!! E.G.M. Petrakis Hashing 31 Linear Hashing (Litwin 1980) Dynamic scheme without index Indices refer to page addresses Overflows are allowed The file grows one page at a time The page which splits is not always the one which overflowed The pages split in a predetermined order E.G.M. Petrakis Hashing 32 Linear Hashing (cont.) Initially n empty pages p points to the page that splits p b Overflows are allowed p E.G.M. Petrakis b Hashing 33 File Growing A page splits whenever the “splitting criterion” is satisfied a page is added at the end of the file pointer p points to the next page split contents of old page between old and new page based on key values p E.G.M. Petrakis Hashing 34 p 0 1 2 125 320 90 435 16 402 711 27 737 712 215 522 3 4 613 303 4 319 u 17 80% split 22 438 new element b=bpage=4, boverflow=1 initially n=5 pages hash function h0=k mod 5 splitting criterion u > A% alternatively split when overflow overflows, etc. E.G.M. Petrakis Hashing 35 0 p 1 2 3 613 303 438 4 319 125 435 215 h0 h0 h1 320 90 16 711 402 27 737 712 h1 h0 h0 522 4 5 18 u 80 % 25 Page 5 is added at end of file The contents of page 0 are split between pages 0 and 5 based on hash function h1 = key mod 10 p points to the next page E.G.M. Petrakis Hashing 36 Hash Functions Initially h0=key mod n As new pages are added at end of file, h0 alone becomes insufficient The file will eventually double its size In that case use h1=key mod 2n In the meantime use h0 for pages not yet split use h1 for pages that have already split Split contents of page pointed to by p based E.G.M. Petrakis on h1 Hashing 37 Hash Functions (cont.) When the file has doubled its size, h0 is no longer needed set h0=h1 and continue (e.g., h0=k mod 10) The file will eventually double its size again Deletions cause merging of pages whenever a merging criterion is satisfied (e.g., u < B%) E.G.M. Petrakis Hashing 38 Hash Functions Initially n pages and 0 <= h0(k) <= n Series of hash functions hi (k) hi 1 (k) i hi (k) n2 Selection of hash function: if hi(k) >= p then use hi(k) else use hi+1(k) E.G.M. Petrakis Hashing 39 Linear Hashing with Partial Expansions (Larson 1980) Problem with Linear Hashing: pages to the right of p delay to split large chains of overflows on rightmost pages Solution: do not wait that much to split a page k partial expansions: take pages in groups of k all k pages of a group split together the file grows at lower rates E.G.M. Petrakis Hashing 40 Two Partial Expansions Initially 2n pages, n groups, 2 pages/group groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1) 0 1 n 2n 2 pointers to pages of the same group Pages in same group spit together => some records go to a new page at end of file (position: 2n) E.G.M. Petrakis Hashing 41 st 1 Expansion After n splits, all pages are split the file has 3n pages (1.5 time larger) the file grows at lower rate 0 n 2n 3n after 1st expansion take pages in groups of 3 pages: (j, j+n, j+2n), 0 <= j <= n E.G.M. Petrakis 0 n Hashing 2n 3n 42 nd 2 Expansion After n splits the file has size 4n repeat the same process having initially 4n pages in 2n groups 2 pointers to pages of the same group 0 1 E.G.M. Petrakis 2n 4n Hashing 43 disk access/retrieval 1,6 Linear Hashing 1,5 1,4 Linear Hashing 2 partial expansions 1,3 1,2 1,1 1 1 retrieval insertion deletion E.G.M. Petrakis 1,2 1,6 relative file size Linear Hashing 1.17 3.57 4.04 1,4 1,8 2 Linear Hashing Linear Hashing 2 part. Exp. 3 part. Exp. 1.12 3.21 3.53 Hashing 1.09 3.31 3.56 b=5 b’ = 5 u = 0.85 44 Dynamic Hashing Schemes Very good performance on membership, insert, delete operations Suitable for both main memory and disk b=1-3 records for main memory b=1-4 Kbytes for disk Critical parameter: space utilization u large u => more overflows, bad performance small u => less overflows, better performance Suitable for direct access queries (random accesses) but not for range queries E.G.M. Petrakis Hashing 45