Hashing - 弘光科技大學

Download Report

Transcript Hashing - 弘光科技大學

Chapter 8
1
Symbol Table
 Symbol table is used widely in many applications.
 dictionary is a kind of symbol table
 data dictionary is database management
 In general, the following operations are performed on
a symbol table
 determine if a particular name is in the table
 retrieve the attribute of that name
 modify the attributes of that name
 insert a new name and its attributes
 delete a name and its attributes
2
Symbol Table
 Popular operations on a symbol table include search,
insertion, and deletion
 A binary search tree could be used to represent a
symbol table.
 The worse-case complexities for the operations are O(n).
 Hashing: insertions, deletions & finds in O(1).
 Static hashing
 Dynamic hashing
3
Static Hashing
 Dictionary pairs are stored in a fixed-size table called




hash table.
The address of location of an dictionary pairs, x, is
obtained by computing some arithmetic function h(x).
The memory available to maintain the symbol table
(hash table) is assumed to be sequential.
The hash table consists of b buckets and each bucket
contains s records.
h(x) maps the set of possible dictionary pairs onto the
integers 0 through b-1.
4
Hash Tables
 The key density of a hash table is the ratio n/T, where
n is the number of dictionary pairs in the table and T is
the total number of possible keys.
 The loading density or loading factor of a hash table is
α= n/(sb).
5
Hash Tables
 Two keys, I1, and I2, are said to be synonyms with
respect to h if h(I1) = h(I2).
 An overflow occurs when a new key i is mapped or
hashed by h into a full bucket.
 A collision occurs when two non-identical keys are
hashed into the same bucket.
 If the bucket size is 1, collisions and overflows occur at
the same time.
6
Example 8.1
 b=26, s=2
 Assume there are 10
distinct keys GA,
A,G,L,A2,A1,A3,A4 and E
 If no overflow occur, the
time required for
hashing depends only on
the time required to
compute the hash
function h.
 A1?
7
Hash Functions
 Requirements
 Simple to compute
 Minimize the number of collisions
 Dependent upon all the characters in the keys
 Uniform hash function
 If there are b buckets, we hope to have h(x)=i with the
probability being (1/b)
 Mid-square, division, folding, digit analysis
8
Division
 Using the modulo (%) operator.
 A key k is divided by some number D and the
remainder is used as the hash address for k.
 The bucket addresses are in the range of 0 through D-1.
 If D is a power of 2, then h(k) depends only on the
least significant bits of k.
9
Division
 If a division function h is used as the hash function,
the table size should not be a power of two.
 Since programmers have a tendency to use many
variables with the same suffix  too many collisions
 If D is divisible by two, the odd keys are mapped to
odd buckets and even keys are mapped to even
buckets. Thus, the hash table is biased.
10
Mid-Square
 Mid-Square function
 It is computed by squaring the key and then using an
appropriate number of bits from the middle of the
square to obtain the bucket address.
 Table size is a power of two
11
Folding
 The key k is partitioned into several parts, all but the
last being of the same length.
 All partitions are added together to obtain the hash
address for k.
 Shift folding: different partitions are added together to
get h(k).
 Folding at the boundaries: key is folded at the partition
boundaries, and digits falling into the same position are
added together to obtain h(k).
 This is similar to reversing every other partition and
then adding.
12
Example 8.2
 k=12320324111220 are partitioned into three decimal
digits long.
 P1 = 123, P2 = 203, P3 = 241, P4 = 112, P5 = 20.
 Shift folding:
 h(k) =
5
𝑙=1 𝑃𝑖
=123 + 203 + 241 + 112 + 20 = 699
 Folding at the boundaries: (reverse P2 and P4)
 h(k) = 123+302+241+211+20=897
13
Secure Hash Functions
 Can be applied to any sized message M
 Produces fixed-length output h
 Is easy to compute h=H(M) for any message M
 Given h is infeasible to find x s.t. H(x)=h
 one-way property
 Given x is infeasible to find y s.t. H(y)=H(x)
 weak collision resistance
 Is infeasible to find any x,y s.t. H(y)=H(x)
 strong collision resistance
14
Secure Hash Algorithm (SHA)
 步驟 1: 對訊息做前處理使得它的長度成為q*512位元,其中q是
某個整數。前處理的程序可能得在訊息的尾端加上許多個0的字
串。
 步驟2: 初始化160-位元的輸出緩衝區OB,它包含五個32-位元的
暫存器A, B, C, D, E,其中A = 67453401,B = efcdab89,C =
98badcfe,D = 10325476,E = c3d2e1f0。(所有的值為16進位)。
 步驟3:
for (int i = 1; i <= q; i++) {
令Bi = 訊息中第i個512位元區塊;
OB = F (OB, Bi);
}
 步驟4: 輸出OB。
15
Atomic SHA
Operation
A
B
t=步驟數;0 ≤ t ≤ 79
ft(B, C, D)=步驟 t的基本(位元的)邏輯函數;
例如,(B^C) v (B^D)
Sk=把32位元的暫存器左迴旋k個位元
Wt=從Bi推導出來的32位元值
Kt=常數
C
D
ft
+
S5
+
S30
A
E
B
C
D
+
Wt
+
Kt
E
16
Example 8.4
 A = 67453401,B = efcdab89,C = 98badcfe,D =





10325476,E = c3d2e1f0
ft(B,C,D)=(BCD), K0=5a82799, W0=0000000
B=1110 1111 1100 1101 1010 1011 1000 1001
S30(B)=0111 1011 1111 0011 0110 1010 1110 0010=7bf36ae2=C
A=6e3edeb6
B=67453401, D=98badcfe, E=10325476
17
Overflow Handling
 There are two ways to handle overflow:
 Open addressing



Linear probing
Quadratic probing
Rehasing
 Chaining
18
Open Addressing
 Assumes the hash table is an array
 The hash table is initialized so that each slot contains
the null key.
 When a new key is hashed into a full bucket, find the
closest unfilled bucket.
 linear probing or linear open addressing
19
Linear Probing
(Example 8.6)
 Assume 26-bucket table with one slot
per bucket and the following keys: GA,
D, A, G, L, A2, A1, A3, A4, Z, ZA, E.
Let the hash function h(k) = first
character of k.
 When entering G, G collides with GA
and is entered at ht[7] instead of
ht[6].
0
A
1
A2
2
A1
3
D
4
A3
5
A4
6
GA
7
G
8
ZA
9
E
10
11
L
12
13
~
…
24
25
Z
20
Linear Probing
 When linear open address is used to handle overflows,
a hash table search for key k proceeds as follows:
 compute h(k)
 examine key at positions ht[h(k)], ht[(h(k)+1) % b], …,
ht[(h(k)+j) % b], in this order until one of the following
condition happens:



ht[(h(k)+j) % b]=k; in this case k is found
ht[h(k)+j] is empty; k is not in the table
We return to the starting position ht[h(k)]; the table is full
and k is not in the table
21
0
A
1
A2
2
A1
3
D
4
A3
5
A4
6
GA
7
G
 This increases search time.
8
ZA
 e.g., to find ZA, you need to examine
9
E
Linear Probing
 When linear probing is used to
resolve overflows, keys tend to cluster
together.
ht[25], ht[0], …, ht[8] (total of 10
comparisons).
 The number of comparisons to look
up a key is approximated (2-α)/(2-2α),
where α is the load density
10
11
L
12
13
~
…
24
25
Z
22
Quadratic Probing
 One of the problems of linear open addressing is that
it tends to create clusters of keys.
 These clusters tend to merge as more keys are entered,
leading to bigger clusters.
 It is difficult to find an unused bucket.
 A quadratic probing scheme improves the growth of
clusters. A quadratic function of i is used as the
increment when searching through buckets.
 Perform search by examining bucket h(k), (h(k)+i2)%b,
(h(k)-i2)%b for 1 ≤ i ≤ (b-1)/2.
23
Example
 (h(k)+i2)%b, (h(k)-i2)%b for 1 ≤ i ≤ (b-1)/2.
 Example: b=10, insert: 89, 18, 49, 58, 69
49
Index 0
1
58
69
2
3
4
5
6
7
18
89
8
9
 89 % 10=9, 18 % 10=8
 49 % 10=9, (9+12) % 10=0
 58 % 10=8, (8+12) % 10=9, (8+22) % 10=2
 69 % 10=9, (9+12) % 10=0, (9+22) % 10=3
24
Rehashing
 Another way to control the growth of clusters is to use
a series of hash functions h1, h2, …, hm. This is called
rehashing.
 Buckets hi(k), 1 ≤ i ≤ m are examined in that order.
25
Chaining
 Linear probing performs poorly because the search for
a key involves comparisons with keys that have
different hash values.
 Unnecessary comparisons can be avoided if all the
synonyms are put in the same list, where one list per
bucket.
 Each chain has a head node. Head nodes are stored
sequentially.
26
Example
Average search length is
(6*1+3*2+1*3+1*4+1*5)/12 = 2
27
Dynamic Hashing
 Retain the fast retrieval time
 Dynamically increasing and decreasing file size without
penalty
 Dynamic hashing is to minimize access to pages
28
An Example Hash Function
k
A0
A1
B0
B1
C1
C2
C3
C5
h(k)
100 000
100 001
101 000
101 001
110 001
110 101
110 011
110 101
A: 100, B: 101, C: 110
29
Hashing With Directory
 Accessing any page only requires two steps:
 First step: use the hash function to find the address of
the directory entry.
 Second step: retrieve the page associated with the
address
30
Dynamic
Hash Tables
with Directories
• Insert C5 into the hash table:
h(C5,2)=01, d[01]A1, B1overflow
determine the least u such that h(k,u) is not the same for all
keys in the overflow bucket  suppose that u=3
h(A1,3)=001, h(B1,3)=001, h(C5,3)=101
• Insert C1 into the hash table:
h(C1,2)=01, d[01]A1, B1overflow
determine the least u such that h(k,u) is not the same for all
keys in the overflow bucket  suppose that u=4
h(A1,4)=0001, h(B1,4)=1001, h(C1,4)=0001
31
Directoryless Dynamic Hashing
 Hashing with directory requires at least one level of
indirection
 Directoryless hashing assume a continuous address
space in the memory to hold all the records. Therefore,
the directory is not needed.
 Thus, the hash function must produce the actual
address of a page containing the key.
 Contrasting to the directory scheme in which a single
page might be pointed at by several directory entries,
in the directoryless scheme one unique page is
assigned to every possible address.
32
Inserting into a Directoryless
Dynamic Hash Table
0~2r+q, 0≤q<2r
00
B4
A0
000
A0
-
溢位桶
000
A0
-
01
A1
B5
001
A1
B5
C5
001
A1
C1
10
C2
-
010
C2
-
010
C2
-
11
C3
-
011
C3
-
011
C3
-
100
B4
-
100
B4
-
101
B5
C5
(a) r =2, q = 0
新的
有效桶
(b) 插入 C5, r =2, q = 1
• 0~q-1 and 2r~2r+q-1 are
indexed using h(k,r+1);
else, h(k,r)
• Insert C5 (110101)
into the hash table:
h(C5,2)=01, d[01]
A1, B5overflow
 activated 2r+q 
reallocating  (b)
000, 100 indexed
using h(k,3)
新的
有效桶
(c) 插入 C1, r =2, q = 2
33