Transcript Hashing
Hashing on the Disk
Keys are stored in “disk pages”
(“buckets”)
several records fit within one page
Retrieval:
find address of page
bring page into main memory
searching within the page comes for
free
E.G.M. Petrakis
Hashing
1
data pages
Σ
0
key
space
1
hash
function
2
.
.
.
.
b
m-1
page size b: maximum number of records in page
space utilization u: measure of the use of space
u
E.G.M. Petrakis
# stored records
# pages b
Hashing
2
Collisions
Keys that hash to the same address
are stored within the same page
If the page is full:
i. page splits: allocate a new page and
ii.
split page content between the old and
the new page or
overflows: list of overflow pages
xxxx
E.G.M. Petrakis
Hashing
overflow
xx
3
Access Time
Goal: find key in one disk access
Access time ~ number of accesses
Large u: good space utilization but
many overflows or splits => more disk
accesses
Non-uniform key distribution => many
keys map to the same addresses =>
overflows or splits => more accesses
E.G.M. Petrakis
Hashing
4
Categories of Methods
Static: require file reorganization
open addressing, separate chaining
Dynamic: dynamic file growth, adapt
to file size
dynamic hashing,
extendible hashing,
linear hashing,
spiral storage…
E.G.M. Petrakis
Hashing
5
Dynamic Hashing Schemes
File size adapts to data size without
total reorganization
Typically 1-3 disk accesses to access
a key
Access time and u are a typical tradeoff
u between 50-100% (typically 69%)
Complicated implementation
E.G.M. Petrakis
Hashing
6
Schemes With Index
Two disk accesses:
one to access the index, one to access the data
with index in main memory => one disk access
Problem: the index may become too large
index data pages
Dynamic hashing (Larson 1978)
Extendible hashing (Fagin et.al. 1979)
E.G.M. Petrakis
Hashing
7
Schemes Without Index
Ideally, less space and less disk accesses
(at least one)
address space
data space
Linear Hashing (Litwin 1980)
Linear Hashing with Partial Expansions
(Larson 1980)
Spiral Storage (Martin 1979)
E.G.M. Petrakis
Hashing
8
Hash Functions
Support for shrinking or growing file
shrinking or growing address space, the
hash function adapts to these changes
hash functions using first (last) bits of
key = bn-1bn-2….bi b i-1…b2b1b0
hi(key)=bi-1…b2b1b0 supports 2i addresses
hi: one more bit than hi-1 to address
larger files
hi 1 (key)
hi (key)
i
h
(key)
2
i 1
E.G.M. Petrakis
Hashing
9
Dynamic Hashing (Larson 1978)
Two level index
primary h1(key): accesses a hash table
secondary h2(key): accesses a binary
tree Index: binary tree
1
2
3
4
h1(k)
E.G.M. Petrakis st
1 level
h2(k)
Hashing
2nd
level
b
data pages
10
Index
Fixed (static): h1(key) = key mod m
Dynamic behavior on secondary index
h2(key) uses i bits of key
the bit sequence of h2=bi-1…b2b1b0
denotes which path on the binary tree
index to follow in order to access the
data page
scan h2 from right to left (bit 1: follow
right path, bit 0: follow left path)
E.G.M. Petrakis
Hashing
11
0
1
2
3
4
5
h1(k)
1st level
index
0
1
1
0
0
2
1
3
h1=1, h2=“0”
h1=1, h2=“01”
h1=1, h2=“11”
h1=5, h2= any
4
b
data pages
h2(k)
2nd level
h1(key) = key mod 6
h2(key) = “01”<= depth of binary tree = 2
E.G.M. Petrakis
Hashing
12
Insertions
Initially fixed size primary bindex and
no data
0
1
2
3
0
1
2
3
h1=1,h2=any
insert record in new page under h1address
if page is full, allocate one extra page
split keys between old and new page
use one extra bit in h2 for addressing
E.G.M. Petrakis
0
1
2
3
0
1
Hashing
h1=1, h2=0
h1=1, h2=1
13
0
1
2
3
0
1
2
3
0
1
2
3
1
b
2
index
0
0
1
2
3
E.G.M. Petrakis
h1=0, h2=any
storage
1
h1=0, h2=0
3
h1=0, h2=1
2
h1=3, h2=any
1
3
4
2
h1=0, h2=0
5
Hashing
h1=3, h2=any
h1=0,
h1=0,
h1=3,
h1=3,
h2=01
h2=11
h2=0
h2=1
14
Deletions
Find record to be deleted using h1, h2
Delete record
Check “sibling” page:
less than b records in both pages ?
if yes merge the two pages
delete one empty page
shrink binary tree index by one level and
reduce h2 by one bit
E.G.M. Petrakis
Hashing
15
1
0
1
2
3
merging
2
3
4
1
0
1
2
3
2
delete
3
4
E.G.M. Petrakis
Hashing
16
Extendible Hashing (Fagin et.al. 1979)
Dynamic hashing without index
Primary hashing is omitted
Only secondary hashing with all binary
trees at the same level
The index shrinks and grows
according to file size
Data pages attached to the index
E.G.M. Petrakis
Hashing
17
0
1
2
3
4
0
1
2
3
4
0
dynamic
hashing
0
00
2
01
2
10
1
11
E.G.M. Petrakis
Hashing
dynamic
hashing with
all binary trees
at same level
number of
address bits
18
Insertions
Initially 1 index and 1 data page
0 address bits
insert records in data page
index
global depth d:
size of index 2d
storage
0
0
local depth l :
Number of address bits
b
E.G.M. Petrakis
Hashing
19
Page “0” Overflows
d
index
storage
l
0
0
d: global depth = 1
l : local depth = 1
b
d
l
1
1
0
1
E.G.M. Petrakis
1
Hashing
20
Page “0” Overflows (cont.)
1 more key bit for addressing and 1 extra
page => index doubles !!
Split contents of previous page between 2
pages according to next bit of key
Global depth d: number of index bits => 2d
index size
Local depth l : number of bits for record
addressing
E.G.M. Petrakis
Hashing
21
Page “0” Overflows (again)
contain records
with same 2 bits of key
d
00
01
10
11
2
2
2
l d
1
contains records
with same 1st bit of key
E.G.M. Petrakis
Hashing
22
Page “01” Overflows
d
000
001
010
011
100
101
110
111
E.G.M. Petrakis
3
2
3
3
1
Hashing
1 more key bit
for addressing
2d-l: number of
pointers to page
23
Page “100” Overflows
2
000
001
010
011
100
101
110
111
3
3
3
2
2
+1
no need to double index
page 100 splits into two (1 new page)
local depth l is increased by 1
E.G.M. Petrakis
Hashing
24
Insertion Algorithm
If l < d, split overflowed page (1 extra
page)
If l = d => index is doubled, page is split
d is increased by 1=>1 more bit for addressing
update pointers (either way):
a) if d prefix bits are used for addressing
d=d+1;
for (i=2d-1, i>=0,i--) index[i]=index[i/2];
b) if d suffix bits are used
for (i=0; i <= 2d-1; i++) index[i]=index[i]+2d-1;
d=d+1
E.G.M. Petrakis
Hashing
25
Deletion Algorithm
Find and delete record
Check sibling page
If less than b records in both pages
merge pages and free empty page
decrease local depth l by 1 (records in
merged page have 1 less common bit)
if l < d everywhere => reduce index (half
size)
update pointers
E.G.M. Petrakis
Hashing
26
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
3
E.G.M. Petrakis
2
3
3
delete with
merging
3
2
2
2
2
2
l<d
2
2
00
01
10
11
2
2
2
2
Hashing
27
Observations
A page splits and there are more than b
keys with same next bit
take one more bit for addressing (increase l)
if d=l the index doubles again !!
Hashing might fail for non-uniform
distributions of keys (e.g., multiple keys
with same value)
if distribution is known, transform it to uniform
Dynamic hashing performs better for nonuniform distributions (affected locally)
E.G.M. Petrakis
Hashing
28
Performance
For n: records and page size b
expected size of index (Flajolet)
1
(1 )
n b
1
(1
)
3.92
n b
l
blog2
b
1 disk access/retrieval when index in
main memory
2 disk accesses when index is on disk
overflows increase number of disk
accesses
E.G.M. Petrakis
Hashing
29
Storage Utilization with Page
Splitting
b
b
before splitting
after splitting
b
u
50% After splitting
2b
In general 50% < u < 100%
On the average u ~ ln2 ~ 69% (no overflows)
E.G.M. Petrakis
Hashing
30
Storage Utilization with
Overflows
Achieves higher u and avoids page doubling (d=l)
b
b
higher u is achieved for small overflow pages
u=2b/3b~66% after splitting
small overflow pages (e.g., b/2) => u = (b+b/2)/2b ~ 75%
double index only if the overflow overflows!!
E.G.M. Petrakis
Hashing
31
Linear Hashing (Litwin 1980)
Dynamic scheme without index
Indices refer to page addresses
Overflows are allowed
The file grows one page at a time
The page which splits is not always
the one which overflowed
The pages split in a predetermined
order
E.G.M. Petrakis
Hashing
32
Linear Hashing (cont.)
Initially n empty pages
p points to the page that splits
p
b
Overflows are allowed
p
E.G.M. Petrakis
b
Hashing
33
File Growing
A page splits whenever the “splitting
criterion” is satisfied
a page is added at the end of the file
pointer p points to the next page
split contents of old page between old
and new page based on key values
p
E.G.M. Petrakis
Hashing
34
p
0
1
2
125
320
90
435
16 402
711
27
737
712
215
522
3
4
613
303
4
319
u
17
80% split
22
438
new element
b=bpage=4, boverflow=1
initially n=5 pages
hash function h0=k mod 5
splitting criterion u > A%
alternatively split when overflow overflows,
etc.
E.G.M. Petrakis
Hashing
35
0
p
1
2
3
613
303
438
4
319
125
435
215
h0
h0
h1
320
90
16
711
402
27
737
712
h1
h0
h0
522
4
5
18
u
80 %
25
Page 5 is added at end of file
The contents of page 0 are split between
pages 0 and 5 based on hash function h1 =
key mod 10
p points to the next page
E.G.M. Petrakis
Hashing
36
Hash Functions
Initially h0=key mod n
As new pages are added at end of file, h0
alone becomes insufficient
The file will eventually double its size
In that case use h1=key mod 2n
In the meantime
use h0 for pages not yet split
use h1 for pages that have already split
Split contents of page pointed to by p
based
E.G.M.
Petrakis on h1
Hashing
37
Hash Functions (cont.)
When the file has doubled its size, h0
is no longer needed
set h0=h1 and continue (e.g., h0=k mod 10)
The file will eventually double its size
again
Deletions cause merging of pages
whenever a merging criterion is
satisfied (e.g., u < B%)
E.G.M. Petrakis
Hashing
38
Hash Functions
Initially n pages and 0 <= h0(k) <= n
Series of hash functions
hi (k)
hi 1 (k)
i
hi (k) n2
Selection of hash function:
if hi(k) >= p then use hi(k)
else use hi+1(k)
E.G.M. Petrakis
Hashing
39
Linear Hashing with Partial Expansions
(Larson 1980)
Problem with Linear Hashing: pages to the
right of p delay to split
large chains of overflows on rightmost pages
Solution: do not wait that much to split a
page
k partial expansions: take pages in groups of k
all k pages of a group split together
the file grows at lower rates
E.G.M. Petrakis
Hashing
40
Two Partial Expansions
Initially 2n pages, n groups, 2 pages/group
groups: (0, n) (1, 1+n)…(i, i+n) … (n-1, 2n-1)
0 1
n
2n
2 pointers
to pages of
the same group
Pages in same group spit together => some
records go to a new page at end of file
(position: 2n)
E.G.M. Petrakis
Hashing
41
st
1
Expansion
After n splits, all pages are split
the file has 3n pages (1.5 time larger)
the file grows at lower rate
0
n
2n
3n
after 1st expansion take pages in groups
of 3 pages: (j, j+n, j+2n), 0 <= j <= n
E.G.M. Petrakis
0
n
Hashing
2n
3n
42
nd
2
Expansion
After n splits the file has size 4n
repeat the same process having initially
4n pages in 2n groups
2 pointers
to pages of
the same group
0 1
E.G.M. Petrakis
2n
4n
Hashing
43
disk access/retrieval
1,6
Linear
Hashing
1,5
1,4
Linear
Hashing
2 partial
expansions
1,3
1,2
1,1
1
1
retrieval
insertion
deletion
E.G.M. Petrakis
1,2
1,6
relative file size
Linear
Hashing
1.17
3.57
4.04
1,4
1,8
2
Linear
Hashing Linear Hashing
2 part. Exp. 3 part. Exp.
1.12
3.21
3.53
Hashing
1.09
3.31
3.56
b=5
b’ = 5
u = 0.85
44
Dynamic Hashing Schemes
Very good performance on membership,
insert, delete operations
Suitable for both main memory and disk
b=1-3 records for main memory
b=1-4 Kbytes for disk
Critical parameter: space utilization u
large u => more overflows, bad performance
small u => less overflows, better performance
Suitable for direct access queries (random
accesses) but not for range queries
E.G.M. Petrakis
Hashing
45