Transcript Hashing

Hashing on the Disk
 Keys are stored in “disk pages”
(“buckets”)
 several records fit within one page
 Retrieval:
 find address of page
 bring page into main memory
 searching within the page comes for
free
E.G.M. Petrakis
Hashing
1
data pages
Σ
0
key
space
1
hash
function
2
.
.
.
.
b
m-1
 page size b: maximum number of records in page
 space utilization u: measure of the use of space
u
E.G.M. Petrakis
# stored records
# pages  b
Hashing
2
Collisions
 Keys that hash to the same address

are stored within the same page
If the page is full:
i. page splits: allocate a new page and
ii.
split page content between the old and
the new page or
overflows: list of overflow pages
xxxx
E.G.M. Petrakis
Hashing
overflow
xx
3
Access Time
 Goal: find key in one disk access
 Access time ~ number of accesses
 Large u: good space utilization but
many overflows or splits => more disk
accesses
 Non-uniform key distribution => many
keys map to the same addresses =>
overflows or splits => more accesses
E.G.M. Petrakis
Hashing
4
Categories of Methods
 Static: require file reorganization
 open addressing, separate chaining
 Dynamic: dynamic file growth, adapt
to file size
 dynamic hashing,
 extendible hashing,
 linear hashing,
 spiral storage…
E.G.M. Petrakis
Hashing
5
Dynamic Hashing Schemes
 File size adapts to data size without
total reorganization
 Typically 1-3 disk accesses to access
a key
 Access time and u are a typical tradeoff
 u between 50-100% (typically 69%)
 Complicated implementation
E.G.M. Petrakis
Hashing
6
Schemes With Index
 Two disk accesses:
 one to access the index, one to access the data
 with index in main memory => one disk access
 Problem: the index may become too large
index data pages
 Dynamic hashing (Larson 1978)
 Extendible hashing (Fagin et.al. 1979)
E.G.M. Petrakis
Hashing
7
Schemes Without Index
 Ideally, less space and less disk accesses
(at least one)
address space
data space
 Linear Hashing (Litwin 1980)
 Linear Hashing with Partial Expansions

(Larson 1980)
Spiral Storage (Martin 1979)
E.G.M. Petrakis
Hashing
8
Hash Functions
 Support for shrinking or growing file
 shrinking or growing address space, the
hash function adapts to these changes
 hash functions using first (last) bits of
key = bn-1bn-2….bi b i-1…b2b1b0
 hi(key)=bi-1…b2b1b0 supports 2i addresses
 hi: one more bit than hi-1 to address
larger files
 hi 1 (key)
hi (key)  
i
h
(key)

2
 i 1
E.G.M. Petrakis
Hashing
9
Dynamic Hashing (Larson 1978)
 Two level index
 primary h1(key): accesses a hash table
 secondary h2(key): accesses a binary
tree Index: binary tree
1
2
3
4
h1(k)
E.G.M. Petrakis st
1 level
h2(k)
Hashing
2nd
level
b
data pages
10
Index
 Fixed (static): h1(key) = key mod m
 Dynamic behavior on secondary index
 h2(key) uses i bits of key
 the bit sequence of h2=bi-1…b2b1b0
denotes which path on the binary tree
index to follow in order to access the
data page
 scan h2 from right to left (bit 1: follow
right path, bit 0: follow left path)
E.G.M. Petrakis
Hashing
11
0
1
2
3
4
5
h1(k)
1st level
index
0
1
1
0
0
2
1
3
h1=1, h2=“0”
h1=1, h2=“01”
h1=1, h2=“11”
h1=5, h2= any
4
b
data pages
h2(k)
2nd level
h1(key) = key mod 6
h2(key) = “01”<= depth of binary tree = 2
E.G.M. Petrakis
Hashing
12
Insertions
 Initially fixed size primary bindex and
no data
0
1
2
3
0
1
2
3
h1=1,h2=any
 insert record in new page under h1address
 if page is full, allocate one extra page
 split keys between old and new page
 use one extra bit in h2 for addressing
E.G.M. Petrakis
0
1
2
3
0
1
Hashing
h1=1, h2=0
h1=1, h2=1
13
0
1
2
3
0
1
2
3
0
1
2
3
1
b
2
index
0
0
1
2
3
E.G.M. Petrakis
h1=0, h2=any
storage
1
h1=0, h2=0
3
h1=0, h2=1
2
h1=3, h2=any
1
3
4
2
h1=0, h2=0
5
Hashing
h1=3, h2=any
h1=0,
h1=0,
h1=3,
h1=3,
h2=01
h2=11
h2=0
h2=1
14
Deletions
 Find record to be deleted using h1, h2
 Delete record
 Check “sibling” page:
 less than b records in both pages ?
 if yes merge the two pages
 delete one empty page
 shrink binary tree index by one level and
reduce h2 by one bit
E.G.M. Petrakis
Hashing
15
1
0
1
2
3
merging
2
3
4
1
0
1
2
3
2
delete
3
4
E.G.M. Petrakis
Hashing
16
Extendible Hashing (Fagin et.al. 1979)
 Dynamic hashing without index
 Primary hashing is omitted
 Only secondary hashing with all binary
trees at the same level
 The index shrinks and grows
according to file size
 Data pages attached to the index
E.G.M. Petrakis
Hashing
17
0
1
2
3
4
0
1
2
3
4
0
dynamic
hashing
0
00
2
01
2
10
1
11
E.G.M. Petrakis
Hashing
dynamic
hashing with
all binary trees
at same level
number of
address bits
18
Insertions
 Initially 1 index and 1 data page
 0 address bits
 insert records in data page
index
global depth d:
size of index 2d
storage
0
0
local depth l :
Number of address bits
b
E.G.M. Petrakis
Hashing
19
Page “0” Overflows
d
index
storage
l
0
0
d: global depth = 1
l : local depth = 1
b
d
l
1
1
0
1
E.G.M. Petrakis
1
Hashing
20
Page “0” Overflows (cont.)
 1 more key bit for addressing and 1 extra



page => index doubles !!
Split contents of previous page between 2
pages according to next bit of key
Global depth d: number of index bits => 2d
index size
Local depth l : number of bits for record
addressing
E.G.M. Petrakis
Hashing
21
Page “0” Overflows (again)
contain records
with same 2 bits of key
d
00
01
10
11
2
2
2
l d
1
contains records
with same 1st bit of key
E.G.M. Petrakis
Hashing
22
Page “01” Overflows
d
000
001
010
011
100
101
110
111
E.G.M. Petrakis
3
2
3
3
1
Hashing
1 more key bit
for addressing
2d-l: number of
pointers to page
23
Page “100” Overflows
2
000
001
010
011
100
101
110
111
3
3
3
2
2
+1
 no need to double index
 page 100 splits into two (1 new page)
 local depth l is increased by 1
E.G.M. Petrakis
Hashing
24
Insertion Algorithm
 If l < d, split overflowed page (1 extra

page)
If l = d => index is doubled, page is split
 d is increased by 1=>1 more bit for addressing
 update pointers (either way):
a) if d prefix bits are used for addressing
d=d+1;
for (i=2d-1, i>=0,i--) index[i]=index[i/2];
b) if d suffix bits are used
for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1;
d=d+1
E.G.M. Petrakis
Hashing
25
Deletion Algorithm
 Find and delete record
 Check sibling page
 If less than b records in both pages
 merge pages and free empty page
 decrease local depth l by 1 (records in
merged page have 1 less common bit)
 if l < d everywhere => reduce index (half
size)
 update pointers
E.G.M. Petrakis
Hashing
26
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
3
E.G.M. Petrakis
2
3
3
delete with
merging
3
2
2
2
2
2
l<d
2
2
00
01
10
11
2
2
2
2
Hashing
27
Observations
 A page splits and there are more than b


keys with same next bit
 take one more bit for addressing (increase l)
 if d=l the index doubles again !!
Hashing might fail for non-uniform
distributions of keys (e.g., multiple keys
with same value)
 if distribution is known, transform it to uniform
Dynamic hashing performs better for nonuniform distributions (affected locally)
E.G.M. Petrakis
Hashing
28
Performance
 For n: records and page size b
 expected size of index (Flajolet)
1
(1  )
n b
1
(1

)
3.92

n b
l
blog2
b
 1 disk access/retrieval when index in
main memory
 2 disk accesses when index is on disk
 overflows increase number of disk
accesses
E.G.M. Petrakis
Hashing
29
Storage Utilization with Page
Splitting
b
b
before splitting


after splitting
b
u
 50% After splitting
2b
In general 50% < u < 100%
On the average u ~ ln2 ~ 69% (no overflows)
E.G.M. Petrakis
Hashing
30
Storage Utilization with
Overflows
 Achieves higher u and avoids page doubling (d=l)
b
b
 higher u is achieved for small overflow pages

 u=2b/3b~66% after splitting
 small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75%
double index only if the overflow overflows!!
E.G.M. Petrakis
Hashing
31
Linear Hashing (Litwin 1980)
 Dynamic scheme without index
 Indices refer to page addresses
 Overflows are allowed
 The file grows one page at a time
 The page which splits is not always
the one which overflowed
 The pages split in a predetermined
order
E.G.M. Petrakis
Hashing
32
Linear Hashing (cont.)
 Initially n empty pages
 p points to the page that splits
p
b
 Overflows are allowed
p
E.G.M. Petrakis
b
Hashing
33
File Growing
 A page splits whenever the “splitting
criterion” is satisfied
 a page is added at the end of the file
 pointer p points to the next page
 split contents of old page between old
and new page based on key values
p
E.G.M. Petrakis
Hashing
34
p
0
1
2
125
320
90
435
16 402
711
27
737
712
215
522
3
4
613
303
4
319
u
17
 80%  split
22
438
new element
b=bpage=4, boverflow=1
initially n=5 pages
hash function h0=k mod 5
splitting criterion u > A%
alternatively split when overflow overflows,
etc.
E.G.M. Petrakis
Hashing
35





0
p
1
2
3
613
303
438
4
319
125
435
215
h0
h0
h1
320
90
16
711
402
27
737
712
h1
h0
h0
522
4
5
18
u
 80 %
25
 Page 5 is added at end of file
 The contents of page 0 are split between

pages 0 and 5 based on hash function h1 =
key mod 10
p points to the next page
E.G.M. Petrakis
Hashing
36
Hash Functions
 Initially h0=key mod n
 As new pages are added at end of file, h0
alone becomes insufficient
 The file will eventually double its size
 In that case use h1=key mod 2n
 In the meantime
 use h0 for pages not yet split
 use h1 for pages that have already split
 Split contents of page pointed to by p
based
E.G.M.
Petrakis on h1
Hashing
37
Hash Functions (cont.)
 When the file has doubled its size, h0
is no longer needed
 set h0=h1 and continue (e.g., h0=k mod 10)
 The file will eventually double its size
again
 Deletions cause merging of pages
whenever a merging criterion is
satisfied (e.g., u < B%)
E.G.M. Petrakis
Hashing
38
Hash Functions
 Initially n pages and 0 <= h0(k) <= n
 Series of hash functions
 hi (k)
hi 1 (k)  
i
hi (k)  n2
 Selection of hash function:
if hi(k) >= p then use hi(k)
else use hi+1(k)
E.G.M. Petrakis
Hashing
39
Linear Hashing with Partial Expansions
(Larson 1980)
 Problem with Linear Hashing: pages to the

right of p delay to split
 large chains of overflows on rightmost pages
Solution: do not wait that much to split a
page
 k partial expansions: take pages in groups of k
 all k pages of a group split together
 the file grows at lower rates
E.G.M. Petrakis
Hashing
40
Two Partial Expansions
 Initially 2n pages, n groups, 2 pages/group
 groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1)
0 1
n
2n
2 pointers
to pages of
the same group
 Pages in same group spit together => some
records go to a new page at end of file
(position: 2n)
E.G.M. Petrakis
Hashing
41
st
1
Expansion
 After n splits, all pages are split
 the file has 3n pages (1.5 time larger)
 the file grows at lower rate
0
n
2n
3n
 after 1st expansion take pages in groups
of 3 pages: (j, j+n, j+2n), 0 <= j <= n
E.G.M. Petrakis
0
n
Hashing
2n
3n
42
nd
2
Expansion
 After n splits the file has size 4n
 repeat the same process having initially
4n pages in 2n groups
2 pointers
to pages of
the same group
0 1
E.G.M. Petrakis
2n
4n
Hashing
43
disk access/retrieval
1,6
Linear
Hashing
1,5
1,4
Linear
Hashing
2 partial
expansions
1,3
1,2
1,1
1
1
retrieval
insertion
deletion
E.G.M. Petrakis
1,2
1,6
relative file size
Linear
Hashing
1.17
3.57
4.04
1,4
1,8
2
Linear
Hashing Linear Hashing
2 part. Exp. 3 part. Exp.
1.12
3.21
3.53
Hashing
1.09
3.31
3.56
b=5
b’ = 5
u = 0.85
44
Dynamic Hashing Schemes
 Very good performance on membership,



insert, delete operations
Suitable for both main memory and disk
 b=1-3 records for main memory
 b=1-4 Kbytes for disk
Critical parameter: space utilization u
 large u => more overflows, bad performance
 small u => less overflows, better performance
Suitable for direct access queries (random
accesses) but not for range queries
E.G.M. Petrakis
Hashing
45