Transcript Hash Functions
Introduction to Algorithms
Jiafen Liu Sept. 2013
Today’s Tasks
Hashing • Direct access tables • Choosing good hash functions – Division Method – Multiplication Method • Resolving collisions by chaining • Resolving collisions by open addressing
Symbol-Table Problem
• Hashing comes up in compilers called the Symbol Table Problem . • Suppose: Table S holding n records: • Operations on S: – INSERT(S, x) – DELETE(S, x) – SEARCH(S, k) • Dynamic Set vs Static Set
The Simplest Case
• Suppose that the keys are drawn from the set U ⊆ {0, 1, …, m–1}, and keys are distinct. • Direct access Table: set up an array T[0 . .m–1] if x ∈ S and key[x] = k, otherwise.
• In the worst case, the 3 operations take time of – Θ(1) • Limitations of direct-access table?
– The range of keys can be large: 64-bit numbers – character strings (difficult to represent it).
• Hashing : Try to keep the table small, while preserving the property of linear running time.
Naïve Hashing
• Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1} .
T 0 h(k 4 ) Keys k 1 k 5 k 4 k 3 k 2 h(k 1 ) h(k 2 ) =h(k 5 ) m-1 h(k 3 )
Collisions
• When a record to be inserted maps to an already occupied slot in T, a collision occurs.
• The Simplest way to solve collision?
– Link records in the same slot into a list.
49 86 52 h(49)=h(86)=h(52)=i
Worst Case of Chaining
• What’s the worst case of chaining?
– Each key hashes to the same slot. The table turn out to be a chaining list.
• Access Time in the worst case?
– Θ(n) if we assume the size of S is n.
Average Case of Chaining
• In order to analyze the average case – we should know all possible inputs and their probability. – We don’t know exactly the distribution, so we always make assumptions.
• Here, we make the assumption of simple uniform hashing
:
– Each key k in S is equally likely be hashed to any slot in T, independent of other keys.
• Simple uniform hashing includes an independence assumption.
Average Case of Chaining
• Let n be the number of keys in the table, and let m be the number of slots.
• Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot?
– 1/m.
• Define: load factor of T to be α= n/m , that means?
– The average number of keys per slot.
Search Cost
• The expected time for an unsuccessful search for a record with a given key is?
Θ(1 + α) apply hash function search the list and access slot • If α= O(1), expected search time = Θ(1) • How about a successful search?
– It has same asymptotic bound. – Reserved for your homework.
Choosing a hash function
• The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice.
– A good hash function should distribute the keys uniformly into all the slots.
– Regularity of the key distribution should not affect this uniformity.
• For example, all the keys are even numbers.
• The simplest way to distribute keys to m slots evenly?
Division Method
• Assume all keys are integers, and define h(k) = k mod m .
• Advantage: Simple and practical usually.
• Caution : – Be careful about choice of modulus m . – It doesn't work well for every size m of table.
• Example: if we pick m with a small divisor d .
Deficiency of Division Method
• Deficiency : if we pick m divisor d .
with a small – Example: d=2, so that m is an even number.
– It happens to all keys are even.
– What happens to the hash table?
– We will never hash anything to an odd numbered slot.
Deficiency of Division Method
• Extreme deficiency: If m= 2 r , that’s to say, all its factors are small divisors. • If k= (1011000111011010) 2 and m=2 What the hash value turns out to be?
6 , • The hash value doesn’t evenly depend on all the bits of k.
• Suppose: all the low order bits are the same, and all the high order bits differ.
How to choose modulus?
• Heuristics for choosing modulus m: – Choose m to be a prime – Make m not close to a power of two or ten.
• Division method is not a really good one: – Sometimes, making the table size a prime is inconvenient. We often want to create a table in size 2 r .
– The other reason is division takes more time to compute compared with multiplication or addition on computers.
Another method —Multiplication
• Multiplication method is a little more complicated but superior.
• Assume that all keys are integers, m = 2 r , and our computer has w -bit words. • Define h(k) = (A·k mod 2 w ) rsh (w –r): – A is an odd integer in the range 2 w –1 < A< 2 w – (Both the highest bit and the lowest bit are 1) .
– rsh is the “bitwise right-shift” operator .
• Multiplication modulo 2 w is fast compared to division, and the rsh operator is fast.
• Tips: Don’t pick A too close to 2 w –1 or 2 w .
Example of multiplication method
• • • Suppose that m= 8 = 2 3 , r=3, and that our computer has w= 7-bit words: • We chose A =1 0 1 1 0 0 1 k =1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 Ignored by mod h(k) Ignored by rsh
Another way to solve collision
• We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record.
• There's another way— open addressing , with idea: No storage for links.
• We should systematically probe the table until an empty slot is found.
Open Addressing
• The hash function depends on both the key and probe number: universe of keys probe number slot number • The probe sequence 〈 h(k,0) , h(k,1) , …, h(k,m –1) 〉 should be a permutation of {0, 1, …, m–1} .
Implementation of Insertion
• What about HASH-SEARCH(T,k)?
Implementation of Searching
More about Open Addressing
• The hash table may fill up.
– We must have the number of elements less than or equal to the table size.
• Deletion is difficult, why?
– When we remove a key out of the table, and somebody is going to find his element. – The probe sequence he uses happens to hit the key we’ve deleted. – He finds it's an empty slot, and says the key I am looking for probably isn't in the table. • We should keep deleted things marked.
Example of open addressing
Example of open addressing
Example of open addressing
Example of open addressing
Some heuristics about probe
• We can record the largest times of probes needed to do an insertion globally. – A search never looks more than that number.
• There are lots of ideas about forming a probe sequence effectively. • The simplest one is ?
– linear probing.
The simplest probing strategy
• Linear probing: given an hash function h(k), linear probing uses h(k,i) = (h(k,0) +i) mod m • Advantage: Simple • Disadvantage?
– primary clustering
Primary Clustering
• It suffers from primary clustering , where regions of the hash table get full.
– Anything that hashes into that region has to look through all the stuff.
– What’s more, where long runs of occupied slots build up, increasing the average search time.
Another probing strategy
• Double hashing: given two ordinary hash functions h 1 (k), h 2 (k), double hashing uses h(k,i) = ( h 1 (k) +i ⋅ h 2 (k) ) mod m • If h 2 (k) is relatively prime to m, double hashing generally produces excellent results. – We always make m a power of 2 and design h 2 (k) to produce only odd numbers.
Analysis of open addressing
• We make the assumption of uniform hashing : – Each key is equally likely to have any one of the m! permutations as its probe sequence, independent of other keys. • Theorem . Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1 –α) .
Proof of the theorem
Proof: • At least one probe is always necessary.
occupied slot, and a second probe is necessary.
hits an occupied slot, and a third probe is necessary.
hits an occupied slot, etc. • And then how to prove?
• Observe that for i= 1, 2, …, n.
Proof of the theorem
• Therefore, the expected number of probes is (geometric series)
Implications of the theorem
• If α is constant, then accessing an open addressed hash table takes constant time.
• If the table is half full, then the expected number of probes is ?
– 1/(1–0.5) = 2 .
• If the table is 90%full, then the expected number of probes is ?
– 1/(1–0.9) = 10 .
• Full utilization in spaces causes hashing slow.
Still Hashing
• Universal hashing • Perfect hashing
A weakness of hashing
• Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot. – It causes the average access time of a hash table to skyrocket.
– An adversary can pick all keys from {k: h(k) = i } for some slot i.
• IDEA: Choose the hash function at random , independently of the keys.
Universal hashing
Universality is good
• Theorem : • Let h be a hash function chosen at random from a universal set H of hash functions. • Suppose h is used to hash n arbitrary keys into the m slots of a table T.
• Then for a given key x, we have: E[number of collisions with x] < n/m .
Universality theorem
• Proof. Let C x be the random variable denoting the total number of collisions of keys in T with x, and let
Universality theorem
For E[c xy ]=1/m
Construction universal hash function set
• One method to construct a set of universal hash functions: • Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1} . – That is, let k =
is • Define
One method of Construction
• How big is H = {h a }?
– |H| = m r + 1 . • Theorem . The set H = {h a } is universal. • Proof. • Suppose that x = 〈 x 0 , x 1 , …, x r 〉 〈 y 0 , y 1 , …, y r 〉 be distinct keys. and y = • Thus, they differ in at least one digit position. • Without loss of generality, position 0. • For how many h a ∈ H do x and y collide?
One method of Construction
• h a (x) = h a (y), which implies that • Equivalently, we have
Fact from number theory
• •
Back to the proof
• We just have and since x 0 ≠ y 0 , an inverse (x 0 – y 0 ) –1 must exist, which implies that • Thus, for any choices of a 1 , a 2 , …, a r , exactly one choice of a 0 causes x and y to collide.
Proof
• How many h a will cause x and y to collide?
– There are m choices for each of a 1 , a 2 , …, a r , but once these are chosen, exactly one choice for a 0 causes x and y to collide, • Thus, the number of h that cause x and y to collide is m r ·1 = m r = |H|/m.
Perfect hashing
• Requirement: Given a set of n in the worst case.
keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time • IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 !
Example of Perfect hashing
Collisions at level 2
• Theorem. Let H be a class of universal hash functions for a table of size m = n 2 . If we use a random h ∈ H to hash n keys into the table, the expected number of collisions is at most 1/2. • Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n 2 . There are pairs of keys that can possibly collide, the expected number of collisions is
Another fact from number theory
• Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t. • Theorem. The probability of no collisions is at least 1/2. • Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2. • Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.