Hash Functions

Download Report

Transcript Hash Functions

Introduction to Algorithms

Jiafen Liu Sept. 2013

Today’s Tasks

Hashing • Direct access tables • Choosing good hash functions – Division Method – Multiplication Method • Resolving collisions by chaining • Resolving collisions by open addressing

Symbol-Table Problem

• Hashing comes up in compilers called the Symbol Table Problem . • Suppose: Table S holding n records: • Operations on S: – INSERT(S, x) – DELETE(S, x) – SEARCH(S, k) • Dynamic Set vs Static Set

The Simplest Case

• Suppose that the keys are drawn from the set U ⊆ {0, 1, …, m–1}, and keys are distinct. • Direct access Table: set up an array T[0 . .m–1] if x ∈ S and key[x] = k, otherwise.

• In the worst case, the 3 operations take time of – Θ(1) • Limitations of direct-access table?

– The range of keys can be large: 64-bit numbers – character strings (difficult to represent it).

• Hashing : Try to keep the table small, while preserving the property of linear running time.

Naïve Hashing

• Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1} .

T 0 h(k 4 ) Keys k 1 k 5 k 4 k 3 k 2 h(k 1 ) h(k 2 ) =h(k 5 ) m-1 h(k 3 )

Collisions

• When a record to be inserted maps to an already occupied slot in T, a collision occurs.

• The Simplest way to solve collision?

– Link records in the same slot into a list.

49 86 52 h(49)=h(86)=h(52)=i

Worst Case of Chaining

• What’s the worst case of chaining?

– Each key hashes to the same slot. The table turn out to be a chaining list.

• Access Time in the worst case?

– Θ(n) if we assume the size of S is n.

Average Case of Chaining

• In order to analyze the average case – we should know all possible inputs and their probability. – We don’t know exactly the distribution, so we always make assumptions.

• Here, we make the assumption of simple uniform hashing

:

– Each key k in S is equally likely be hashed to any slot in T, independent of other keys.

• Simple uniform hashing includes an independence assumption.

Average Case of Chaining

• Let n be the number of keys in the table, and let m be the number of slots.

• Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot?

– 1/m.

• Define: load factor of T to be α= n/m , that means?

– The average number of keys per slot.

Search Cost

• The expected time for an unsuccessful search for a record with a given key is?

Θ(1 + α) apply hash function search the list and access slot • If α= O(1), expected search time = Θ(1) • How about a successful search?

– It has same asymptotic bound. – Reserved for your homework.

Choosing a hash function

• The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice.

– A good hash function should distribute the keys uniformly into all the slots.

– Regularity of the key distribution should not affect this uniformity.

• For example, all the keys are even numbers.

• The simplest way to distribute keys to m slots evenly?

Division Method

• Assume all keys are integers, and define h(k) = k mod m .

• Advantage: Simple and practical usually.

• Caution : – Be careful about choice of modulus m . – It doesn't work well for every size m of table.

• Example: if we pick m with a small divisor d .

Deficiency of Division Method

• Deficiency : if we pick m divisor d .

with a small – Example: d=2, so that m is an even number.

– It happens to all keys are even.

– What happens to the hash table?

– We will never hash anything to an odd numbered slot.

Deficiency of Division Method

• Extreme deficiency: If m= 2 r , that’s to say, all its factors are small divisors. • If k= (1011000111011010) 2 and m=2 What the hash value turns out to be?

6 , • The hash value doesn’t evenly depend on all the bits of k.

• Suppose: all the low order bits are the same, and all the high order bits differ.

How to choose modulus?

• Heuristics for choosing modulus m: – Choose m to be a prime – Make m not close to a power of two or ten.

• Division method is not a really good one: – Sometimes, making the table size a prime is inconvenient. We often want to create a table in size 2 r .

– The other reason is division takes more time to compute compared with multiplication or addition on computers.

Another method —Multiplication

• Multiplication method is a little more complicated but superior.

• Assume that all keys are integers, m = 2 r , and our computer has w -bit words. • Define h(k) = (A·k mod 2 w ) rsh (w –r): – A is an odd integer in the range 2 w –1 < A< 2 w – (Both the highest bit and the lowest bit are 1) .

– rsh is the “bitwise right-shift” operator .

• Multiplication modulo 2 w is fast compared to division, and the rsh operator is fast.

• Tips: Don’t pick A too close to 2 w –1 or 2 w .

Example of multiplication method

• • • Suppose that m= 8 = 2 3 , r=3, and that our computer has w= 7-bit words: • We chose A =1 0 1 1 0 0 1 k =1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 Ignored by mod h(k) Ignored by rsh

Another way to solve collision

• We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record.

• There's another way— open addressing , with idea: No storage for links.

• We should systematically probe the table until an empty slot is found.

Open Addressing

• The hash function depends on both the key and probe number: universe of keys probe number slot number • The probe sequence 〈 h(k,0) , h(k,1) , …, h(k,m –1) 〉 should be a permutation of {0, 1, …, m–1} .

Implementation of Insertion

• What about HASH-SEARCH(T,k)?

Implementation of Searching

More about Open Addressing

• The hash table may fill up.

– We must have the number of elements less than or equal to the table size.

• Deletion is difficult, why?

– When we remove a key out of the table, and somebody is going to find his element. – The probe sequence he uses happens to hit the key we’ve deleted. – He finds it's an empty slot, and says the key I am looking for probably isn't in the table. • We should keep deleted things marked.

Example of open addressing

Example of open addressing

Example of open addressing

Example of open addressing

Some heuristics about probe

• We can record the largest times of probes needed to do an insertion globally. – A search never looks more than that number.

• There are lots of ideas about forming a probe sequence effectively. • The simplest one is ?

– linear probing.

The simplest probing strategy

• Linear probing: given an hash function h(k), linear probing uses h(k,i) = (h(k,0) +i) mod m • Advantage: Simple • Disadvantage?

– primary clustering

Primary Clustering

• It suffers from primary clustering , where regions of the hash table get full.

– Anything that hashes into that region has to look through all the stuff.

– What’s more, where long runs of occupied slots build up, increasing the average search time.

Another probing strategy

• Double hashing: given two ordinary hash functions h 1 (k), h 2 (k), double hashing uses h(k,i) = ( h 1 (k) +i ⋅ h 2 (k) ) mod m • If h 2 (k) is relatively prime to m, double hashing generally produces excellent results. – We always make m a power of 2 and design h 2 (k) to produce only odd numbers.

Analysis of open addressing

• We make the assumption of uniform hashing : – Each key is equally likely to have any one of the m! permutations as its probe sequence, independent of other keys. • Theorem . Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1 –α) .

Proof of the theorem

Proof: • At least one probe is always necessary.

occupied slot, and a second probe is necessary.

hits an occupied slot, and a third probe is necessary.

hits an occupied slot, etc. • And then how to prove?

• Observe that for i= 1, 2, …, n.

Proof of the theorem

• Therefore, the expected number of probes is (geometric series)

Implications of the theorem

• If α is constant, then accessing an open addressed hash table takes constant time.

• If the table is half full, then the expected number of probes is ?

– 1/(1–0.5) = 2 .

• If the table is 90%full, then the expected number of probes is ?

– 1/(1–0.9) = 10 .

• Full utilization in spaces causes hashing slow.

Still Hashing

• Universal hashing • Perfect hashing

A weakness of hashing

• Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot. – It causes the average access time of a hash table to skyrocket.

– An adversary can pick all keys from {k: h(k) = i } for some slot i.

• IDEA: Choose the hash function at random , independently of the keys.

Universal hashing

Universality is good

• Theorem : • Let h be a hash function chosen at random from a universal set H of hash functions. • Suppose h is used to hash n arbitrary keys into the m slots of a table T.

• Then for a given key x, we have: E[number of collisions with x] < n/m .

Universality theorem

• Proof. Let C x be the random variable denoting the total number of collisions of keys in T with x, and let

Universality theorem

For E[c xy ]=1/m

Construction universal hash function set

• One method to construct a set of universal hash functions: • Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1} . – That is, let k = , where 0≤k i

is • Define

One method of Construction

• How big is H = {h a }?

– |H| = m r + 1 . • Theorem . The set H = {h a } is universal. • Proof. • Suppose that x = 〈 x 0 , x 1 , …, x r 〉 〈 y 0 , y 1 , …, y r 〉 be distinct keys. and y = • Thus, they differ in at least one digit position. • Without loss of generality, position 0. • For how many h a ∈ H do x and y collide?

One method of Construction

• h a (x) = h a (y), which implies that • Equivalently, we have

Fact from number theory

• •

Back to the proof

• We just have and since x 0 ≠ y 0 , an inverse (x 0 – y 0 ) –1 must exist, which implies that • Thus, for any choices of a 1 , a 2 , …, a r , exactly one choice of a 0 causes x and y to collide.

Proof

• How many h a will cause x and y to collide?

– There are m choices for each of a 1 , a 2 , …, a r , but once these are chosen, exactly one choice for a 0 causes x and y to collide, • Thus, the number of h that cause x and y to collide is m r ·1 = m r = |H|/m.

Perfect hashing

• Requirement: Given a set of n in the worst case.

keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time • IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 !

Example of Perfect hashing

Collisions at level 2

• Theorem. Let H be a class of universal hash functions for a table of size m = n 2 . If we use a random h ∈ H to hash n keys into the table, the expected number of collisions is at most 1/2. • Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n 2 . There are pairs of keys that can possibly collide, the expected number of collisions is

Another fact from number theory

• Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t. • Theorem. The probability of no collisions is at least 1/2. • Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2. • Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.