Transcript Hash Functions
Introduction to Algorithms
Jiafen Liu Sept. 2013
Today’s Tasks
Hashing • Direct access tables • Choosing good hash functions – Division Method – Multiplication Method • Resolving collisions by chaining • Resolving collisions by open addressing
Symbol-Table Problem
• Hashing comes up in compilers called the Symbol Table Problem . • Suppose: Table S holding n records: • Operations on S: – INSERT(S, x) – DELETE(S, x) – SEARCH(S, k) • Dynamic Set vs Static Set
The Simplest Case
• Suppose that the keys are drawn from the set U ⊆ {0, 1, …, m–1}, and keys are distinct. • Direct access Table: set up an array T[0 . .m–1] if x ∈ S and key[x] = k, otherwise.
• In the worst case, the 3 operations take time of – Θ(1) • Limitations of direct-access table?
– The range of keys can be large: 64-bit numbers – character strings (difficult to represent it).
• Hashing : Try to keep the table small, while preserving the property of linear running time.
Naïve Hashing
• Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1} .
T 0 h(k 4 ) Keys k 1 k 5 k 4 k 3 k 2 h(k 1 ) h(k 2 ) =h(k 5 ) m-1 h(k 3 )
Collisions
• When a record to be inserted maps to an already occupied slot in T, a collision occurs.
• The Simplest way to solve collision?
– Link records in the same slot into a list.
49 86 52 h(49)=h(86)=h(52)=i
Worst Case of Chaining
• What’s the worst case of chaining?
– Each key hashes to the same slot. The table turn out to be a chaining list.
• Access Time in the worst case?
– Θ(n) if we assume the size of S is n.
Average Case of Chaining
• In order to analyze the average case – we should know all possible inputs and their probability. – We don’t know exactly the distribution, so we always make assumptions.
• Here, we make the assumption of simple uniform hashing
:
– Each key k in S is equally likely be hashed to any slot in T, independent of other keys.
• Simple uniform hashing includes an independence assumption.
Average Case of Chaining
• Let n be the number of keys in the table, and let m be the number of slots.
• Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot?
– 1/m.
• Define: load factor of T to be α= n/m , that means?
– The average number of keys per slot.
Search Cost
• The expected time for an unsuccessful search for a record with a given key is?
Θ(1 + α) apply hash function search the list and access slot • If α= O(1), expected search time = Θ(1) • How about a successful search?
– It has same asymptotic bound. – Reserved for your homework.
Choosing a hash function
• The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice.
– A good hash function should distribute the keys uniformly into all the slots.
– Regularity of the key distribution should not affect this uniformity.
• For example, all the keys are even numbers.
• The simplest way to distribute keys to m slots evenly?
Division Method
• Assume all keys are integers, and define h(k) = k mod m .
• Advantage: Simple and practical usually.
• Caution : – Be careful about choice of modulus m . – It doesn't work well for every size m of table.
• Example: if we pick m with a small divisor d .
Deficiency of Division Method
• Deficiency : if we pick m divisor d .
with a small – Example: d=2, so that m is an even number.
– It happens to all keys are even.
– What happens to the hash table?
– We will never hash anything to an odd numbered slot.
Deficiency of Division Method
• Extreme deficiency: If m= 2 r , that’s to say, all its factors are small divisors. • If k= (1011000111011010) 2 and m=2 What the hash value turns out to be?
6 , • The hash value doesn’t evenly depend on all the bits of k.
• Suppose: all the low order bits are the same, and all the high order bits differ.
How to choose modulus?
• Heuristics for choosing modulus m: – Choose m to be a prime – Make m not close to a power of two or ten.
• Division method is not a really good one: – Sometimes, making the table size a prime is inconvenient. We often want to create a table in size 2 r .
– The other reason is division takes more time to compute compared with multiplication or addition on computers.
Another method —Multiplication
• Multiplication method is a little more complicated but superior.
• Assume that all keys are integers, m = 2 r , and our computer has w -bit words. • Define h(k) = (A·k mod 2 w ) rsh (w –r): – A is an odd integer in the range 2 w –1 < A< 2 w – (Both the highest bit and the lowest bit are 1) .
– rsh is the “bitwise right-shift” operator .
• Multiplication modulo 2 w is fast compared to division, and the rsh operator is fast.
• Tips: Don’t pick A too close to 2 w –1 or 2 w .
Example of multiplication method
• • • Suppose that m= 8 = 2 3 , r=3, and that our computer has w= 7-bit words: • We chose A =1 0 1 1 0 0 1 k =1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 Ignored by mod h(k) Ignored by rsh
Another way to solve collision
• We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record.
• There's another way— open addressing , with idea: No storage for links.
• We should systematically probe the table until an empty slot is found.
Open Addressing
• The hash function depends on both the key and probe number: universe of keys probe number slot number • The probe sequence 〈 h(k,0) , h(k,1) , …, h(k,m –1) 〉 should be a permutation of {0, 1, …, m–1} .
Implementation of Insertion
• What about HASH-SEARCH(T,k)?
Implementation of Searching
More about Open Addressing
• The hash table may fill up.
– We must have the number of elements less than or equal to the table size.
• Deletion is difficult, why?
– When we remove a key out of the table, and somebody is going to find his element. – The probe sequence he uses happens to hit the key we’ve deleted. – He finds it's an empty slot, and says the key I am looking for probably isn't in the table. • We should keep deleted things marked.
Example of open addressing
Example of open addressing
Example of open addressing
Example of open addressing
Some heuristics about probe
• We can record the largest times of probes needed to do an insertion globally. – A search never looks more than that number.
• There are lots of ideas about forming a probe sequence effectively. • The simplest one is ?
– linear probing.
The simplest probing strategy
• Linear probing: given an hash function h(k), linear probing uses h(k,i) = (h(k,0) +i) mod m • Advantage: Simple • Disadvantage?
– primary clustering
Primary Clustering
• It suffers from primary clustering , where regions of the hash table get full.
– Anything that hashes into that region has to look through all the stuff.
– What’s more, where long runs of occupied slots build up, increasing the average search time.
Another probing strategy
• Double hashing: given two ordinary hash functions h 1 (k), h 2 (k), double hashing uses h(k,i) = ( h 1 (k) +i ⋅ h 2 (k) ) mod m • If h 2 (k) is relatively prime to m, double hashing generally produces excellent results. – We always make m a power of 2 and design h 2 (k) to produce only odd numbers.
Analysis of open addressing
• We make the assumption of uniform hashing : – Each key is equally likely to have any one of the m! permutations as its probe sequence, independent of other keys. • Theorem . Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1 –α) .
Proof of the theorem
Proof: • At least one probe is always necessary.
occupied slot, and a second probe is necessary.
hits an occupied slot, and a third probe is necessary.
hits an occupied slot, etc. • And then how to prove?
• Observe that for i= 1, 2, …, n.
Proof of the theorem
• Therefore, the expected number of probes is (geometric series)
Implications of the theorem
• If α is constant, then accessing an open addressed hash table takes constant time.
• If the table is half full, then the expected number of probes is ?
– 1/(1–0.5) = 2 .
• If the table is 90%full, then the expected number of probes is ?
– 1/(1–0.9) = 10 .
• Full utilization in spaces causes hashing slow.
Still Hashing
• Universal hashing • Perfect hashing
A weakness of hashing
• Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot. – It causes the average access time of a hash table to skyrocket.
– An adversary can pick all keys from {k: h(k) = i } for some slot i.
• IDEA: Choose the hash function at random , independently of the keys.
Universal hashing
Universality is good
• Theorem : • Let h be a hash function chosen at random from a universal set H of hash functions. • Suppose h is used to hash n arbitrary keys into the m slots of a table T.
• Then for a given key x, we have: E[number of collisions with x] < n/m .
Universality theorem
• Proof. Let C x be the random variable denoting the total number of collisions of keys in T with x, and let
Universality theorem
For E[c xy ]=1/m
Construction universal hash function set
• One method to construct a set of universal hash functions: • Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1} . – That is, let k = is • Define • How big is H = {h a }? – |H| = m r + 1 . • Theorem . The set H = {h a } is universal. • Proof. • Suppose that x = 〈 x 0 , x 1 , …, x r 〉 〈 y 0 , y 1 , …, y r 〉 be distinct keys. and y = • Thus, they differ in at least one digit position. • Without loss of generality, position 0. • For how many h a ∈ H do x and y collide? • h a (x) = h a (y), which implies that • Equivalently, we have • • • We just have and since x 0 ≠ y 0 , an inverse (x 0 – y 0 ) –1 must exist, which implies that • Thus, for any choices of a 1 , a 2 , …, a r , exactly one choice of a 0 causes x and y to collide. • How many h a will cause x and y to collide? – There are m choices for each of a 1 , a 2 , …, a r , but once these are chosen, exactly one choice for a 0 causes x and y to collide, • Thus, the number of h that cause x and y to collide is m r ·1 = m r = |H|/m. • Requirement: Given a set of n in the worst case. keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time • IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 ! • Theorem. Let H be a class of universal hash functions for a table of size m = n 2 . If we use a random h ∈ H to hash n keys into the table, the expected number of collisions is at most 1/2. • Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n 2 . There are pairs of keys that can possibly collide, the expected number of collisions is • Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t. • Theorem. The probability of no collisions is at least 1/2. • Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2. • Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works. One method of Construction
One method of Construction
Fact from number theory
Back to the proof
Proof
Perfect hashing
Example of Perfect hashing
Collisions at level 2
Another fact from number theory