Hash Functions

Transcript Hash Functions

Introduction to Algorithms

Jiafen Liu Sept. 2013

Today’s Tasks

Hashing • Direct access tables • Choosing good hash functions – Division Method – Multiplication Method • Resolving collisions by chaining • Resolving collisions by open addressing

Symbol-Table Problem

• Hashing comes up in compilers called the Symbol Table Problem . • Suppose: Table S holding n records: • Operations on S: – INSERT(S, x) – DELETE(S, x) – SEARCH(S, k) • Dynamic Set vs Static Set

The Simplest Case

• Suppose that the keys are drawn from the set U ⊆ {0, 1, …, m–1}, and keys are distinct. • Direct access Table: set up an array T[0 . .m–1] if x ∈ S and key[x] = k, otherwise.

• In the worst case, the 3 operations take time of – Θ(1) • Limitations of direct-access table?

– The range of keys can be large: 64-bit numbers – character strings (difficult to represent it).

• Hashing : Try to keep the table small, while preserving the property of linear running time.

Naïve Hashing

• Solution: Use a hash function h to map the keys of records in S into {0, 1, …, m–1} .

T 0 h(k 4 ) Keys k 1 k 5 k 4 k 3 k 2 h(k 1 ) h(k 2 ) =h(k 5 ) m-1 h(k 3 )

Collisions

• When a record to be inserted maps to an already occupied slot in T, a collision occurs.

• The Simplest way to solve collision?

– Link records in the same slot into a list.

49 86 52 h(49)=h(86)=h(52)=i

Worst Case of Chaining

• What’s the worst case of chaining?

– Each key hashes to the same slot. The table turn out to be a chaining list.

• Access Time in the worst case?

– Θ(n) if we assume the size of S is n.

Average Case of Chaining

• In order to analyze the average case – we should know all possible inputs and their probability. – We don’t know exactly the distribution, so we always make assumptions.

• Here, we make the assumption of simple uniform hashing

– Each key k in S is equally likely be hashed to any slot in T, independent of other keys.

• Simple uniform hashing includes an independence assumption.

Average Case of Chaining

• Let n be the number of keys in the table, and let m be the number of slots.

• Under simple uniform hashing assumption what’s the possibility of two keys are hashed to the same slot?

– 1/m.

• Define: load factor of T to be α= n/m , that means?

– The average number of keys per slot.

Search Cost

• The expected time for an unsuccessful search for a record with a given key is?

Θ(1 + α) apply hash function search the list and access slot • If α= O(1), expected search time = Θ(1) • How about a successful search?

– It has same asymptotic bound. – Reserved for your homework.

Choosing a hash function

• The assumption of simple uniform hashing is hard to guarantee, but several common techniques tend to work well in practice.

– A good hash function should distribute the keys uniformly into all the slots.

– Regularity of the key distribution should not affect this uniformity.

• For example, all the keys are even numbers.

• The simplest way to distribute keys to m slots evenly?

Division Method

• Assume all keys are integers, and define h(k) = k mod m .

• Advantage: Simple and practical usually.

• Caution : – Be careful about choice of modulus m . – It doesn't work well for every size m of table.

• Example: if we pick m with a small divisor d .

Deficiency of Division Method

• Deficiency : if we pick m divisor d .

with a small – Example: d=2, so that m is an even number.

– It happens to all keys are even.

– What happens to the hash table?

– We will never hash anything to an odd numbered slot.

Deficiency of Division Method

• Extreme deficiency: If m= 2 r , that’s to say, all its factors are small divisors. • If k= (1011000111011010) 2 and m=2 What the hash value turns out to be?

6 , • The hash value doesn’t evenly depend on all the bits of k.

• Suppose: all the low order bits are the same, and all the high order bits differ.

How to choose modulus?

• Heuristics for choosing modulus m: – Choose m to be a prime – Make m not close to a power of two or ten.

• Division method is not a really good one: – Sometimes, making the table size a prime is inconvenient. We often want to create a table in size 2 r .

– The other reason is division takes more time to compute compared with multiplication or addition on computers.

Another method —Multiplication

• Multiplication method is a little more complicated but superior.

• Assume that all keys are integers, m = 2 r , and our computer has w -bit words. • Define h(k) = (A·k mod 2 w ) rsh (w –r): – A is an odd integer in the range 2 w –1 < A< 2 w – (Both the highest bit and the lowest bit are 1) .

– rsh is the “bitwise right-shift” operator .

• Multiplication modulo 2 w is fast compared to division, and the rsh operator is fast.

• Tips: Don’t pick A too close to 2 w –1 or 2 w .

Example of multiplication method

• • • Suppose that m= 8 = 2 3 , r=3, and that our computer has w= 7-bit words: • We chose A =1 0 1 1 0 0 1 k =1 1 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 0 0 1 1 Ignored by mod h(k) Ignored by rsh

Another way to solve collision

• We’ve talked about resolving collisions by chaining. With chaining, we need an extra link field in each record.

• There's another way— open addressing , with idea: No storage for links.

• We should systematically probe the table until an empty slot is found.

Open Addressing

• The hash function depends on both the key and probe number: universe of keys probe number slot number • The probe sequence 〈 h(k,0) , h(k,1) , …, h(k,m –1) 〉 should be a permutation of {0, 1, …, m–1} .

Implementation of Insertion

• What about HASH-SEARCH(T,k)?

Implementation of Searching

More about Open Addressing

• The hash table may fill up.

– We must have the number of elements less than or equal to the table size.

• Deletion is difficult, why?

– When we remove a key out of the table, and somebody is going to find his element. – The probe sequence he uses happens to hit the key we’ve deleted. – He finds it's an empty slot, and says the key I am looking for probably isn't in the table. • We should keep deleted things marked.

Example of open addressing

Some heuristics about probe

• We can record the largest times of probes needed to do an insertion globally. – A search never looks more than that number.

• There are lots of ideas about forming a probe sequence effectively. • The simplest one is ?

– linear probing.

The simplest probing strategy

• Linear probing: given an hash function h(k), linear probing uses h(k,i) = (h(k,0) +i) mod m • Advantage: Simple • Disadvantage?

– primary clustering

Primary Clustering

• It suffers from primary clustering , where regions of the hash table get full.

– Anything that hashes into that region has to look through all the stuff.

– What’s more, where long runs of occupied slots build up, increasing the average search time.

Another probing strategy

• Double hashing: given two ordinary hash functions h 1 (k), h 2 (k), double hashing uses h(k,i) = ( h 1 (k) +i ⋅ h 2 (k) ) mod m • If h 2 (k) is relatively prime to m, double hashing generally produces excellent results. – We always make m a power of 2 and design h 2 (k) to produce only odd numbers.

Analysis of open addressing

• We make the assumption of uniform hashing : – Each key is equally likely to have any one of the m! permutations as its probe sequence, independent of other keys. • Theorem . Given an open-addressed hash table with load factor α= n/m< 1, the expected number of probes in an unsuccessful search is at most 1/(1 –α) .

Proof of the theorem

Proof: • At least one probe is always necessary.

occupied slot, and a second probe is necessary.

hits an occupied slot, and a third probe is necessary.

hits an occupied slot, etc. • And then how to prove?

• Observe that for i= 1, 2, …, n.

Proof of the theorem

• Therefore, the expected number of probes is (geometric series)

Implications of the theorem

• If α is constant, then accessing an open addressed hash table takes constant time.

• If the table is half full, then the expected number of probes is ?

– 1/(1–0.5) = 2 .

• If the table is 90%full, then the expected number of probes is ?

– 1/(1–0.9) = 10 .

• Full utilization in spaces causes hashing slow.

Still Hashing

• Universal hashing • Perfect hashing

A weakness of hashing

• Problem: For any hash function h, there exists a bad set of keys that all hash to the same slot. – It causes the average access time of a hash table to skyrocket.

– An adversary can pick all keys from {k: h(k) = i } for some slot i.

• IDEA: Choose the hash function at random , independently of the keys.

Universal hashing

Universality is good

• Theorem : • Let h be a hash function chosen at random from a universal set H of hash functions. • Suppose h is used to hash n arbitrary keys into the m slots of a table T.

• Then for a given key x, we have: E[number of collisions with x] < n/m .

Universality theorem

• Proof. Let C x be the random variable denoting the total number of collisions of keys in T with x, and let

Universality theorem

For E[c xy ]=1/m

Construction universal hash function set

• One method to construct a set of universal hash functions: • Let m be prime. Decompose key k into r+1 digits, each with value in the set {0, 1, …, m–1} . – That is, let k = , where 0≤k i

is • Define

One method of Construction

• How big is H = {h a }?

– |H| = m r + 1 . • Theorem . The set H = {h a } is universal. • Proof. • Suppose that x = 〈 x 0 , x 1 , …, x r 〉 〈 y 0 , y 1 , …, y r 〉 be distinct keys. and y = • Thus, they differ in at least one digit position. • Without loss of generality, position 0. • For how many h a ∈ H do x and y collide?

One method of Construction

• h a (x) = h a (y), which implies that • Equivalently, we have

Fact from number theory

• •

Back to the proof

• We just have and since x 0 ≠ y 0 , an inverse (x 0 – y 0 ) –1 must exist, which implies that • Thus, for any choices of a 1 , a 2 , …, a r , exactly one choice of a 0 causes x and y to collide.

Proof

• How many h a will cause x and y to collide?

– There are m choices for each of a 1 , a 2 , …, a r , but once these are chosen, exactly one choice for a 0 causes x and y to collide, • Thus, the number of h that cause x and y to collide is m r ·1 = m r = |H|/m.

Perfect hashing

• Requirement: Given a set of n in the worst case.

keys, construct a static hash table of size m = O(n) such that SEARCH takes Θ(1) time • IDEA: Two- level scheme with universal hashing at both levels. No collisions at level 2 !

Example of Perfect hashing

Collisions at level 2

• Theorem. Let H be a class of universal hash functions for a table of size m = n 2 . If we use a random h ∈ H to hash n keys into the table, the expected number of collisions is at most 1/2. • Proof. By the definition of universality, the probability that two given keys collide under h is 1/m = 1/n 2 . There are pairs of keys that can possibly collide, the expected number of collisions is

Another fact from number theory

• Markov’s inequality says that for any non negative random variable X, we have Pr{X ≥ t} ≤ E[X]/t. • Theorem. The probability of no collisions is at least 1/2. • Proof. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2. • Conclusion: Just by testing random hash functions in H, we’ll quickly find one that works.

Hash Functions

Transcript Hash Functions

Introduction to Algorithms

Today’s Tasks

Symbol-Table Problem

The Simplest Case

Naïve Hashing

Collisions

Worst Case of Chaining

Average Case of Chaining

Average Case of Chaining

Search Cost

Choosing a hash function

Division Method

Deficiency of Division Method

Deficiency of Division Method

How to choose modulus?

Another method —Multiplication

Example of multiplication method

Another way to solve collision

Open Addressing

Implementation of Insertion

Implementation of Searching

More about Open Addressing

Example of open addressing

Example of open addressing

Example of open addressing

Example of open addressing

Some heuristics about probe

The simplest probing strategy

Primary Clustering

Another probing strategy

Analysis of open addressing

Proof of the theorem

Proof of the theorem

Implications of the theorem

Still Hashing

A weakness of hashing

Universal hashing

Universality is good

Universality theorem

Universality theorem

One method of Construction

One method of Construction

Fact from number theory

Back to the proof

Proof

Perfect hashing

Example of Perfect hashing

Collisions at level 2

Another fact from number theory

Directory