Transcript Hashing

Hashing
Vishnu Kotrajaras, PhD
What do we want to do?
•
•
•
•
•
Insert
Delete
find (constant time)
No sorting
No Findmin findmax
Hash table
• We have key and value.
• The key is an argument of our hash
function.
• The result of a hash function is an index
that we will store our value.
• Therefore a hash function should:
– Be easy to calculate.
– Different keys must give you different index.
This is difficult to achieve, but it can be done.
Hash function
• We use it to try to distribute values evenly
throughout our table. We may use:
– Key number % tableSize
• But if tableSize is 10, 20, 30, …we cannot use this
function.
– What if keys are Strings?
• Let’s see some example.
Hash function (1st example)
• Sum the ASCII values of all alphabets
• public static int hash(String key, int tableSize){
int hashVal = 0;
for(int i =0; i<key.length(); i++)
hashVal += key.charAt(i);
return hashVal%tableSize;
}
• The method in the last page is not good if the table is
large:
– Whet if each key is short (e.g. 8 alphabets?)
– An ASCII normally has a maximum value of 127.
• Therefore the sum of all 8 alphabets will not exceed 127*8.
– If the table is big, data will not be distributed evenly.
The
10,000th
member
Indices will concentrate at the front.
Hash function (2nd example)
• Assume we have a big table, and each
key is made from at least 3 random
alphabets.
• We look at the first 3 alphabets only.
public static int hash(String key, int tableSize){
return (key.charAt(0) +27*key.charAt(1) +729*
key.charAT(2))%tableSize;
}
All alphabets, including space 27*27
This distributes well in a table of size 10000. (10007 is the first prime after
10000, we will use this number. You will see why).
• Wait, any actual key will never be random
like this:
– There will be a lot of repetition.
Hash function (3rd example)
• We calculate a polynomial function of 37, using
Horner’s Rule.
• We can calculate k0 + 37k1+ 37*37k2 by using
[(k2*37)+k1]*37 +k0
Horner rule is to repeat this -> n times. In fact, it
is a calculation of:
KeySize1
 Key[ KeySize i 1] * 37
i
i 0
public static int hash(String key, int tableSize){
int hashVal = 0;
for(int i =0; i<key.length(); i++)
hashVal= 37*hashVal+key.charAt(i);
hashVal %= tableSize;
if(hashVal<0)
hashVal += tableSize;
Possible overflow
return hashVal;
}
• May not be very well distributed, but it’s easy to
calculate.
• But if a key is long, the corresponding
calculation will take some time.
– We solve it by not using every alphabet.
– We may chose alphabets from important parts of the
key.
• In any case a hash function cannot distribute
items into a table with 100% uniquely different
indices.
• When 2 or more values fall in the same slot we
say it is a collision.
• How do we fix a collision?
Fixing collision: separate chaining
•
•
•
Store repeated elements in a linked list.
If you want to search for an element, use hash function, then search in the
list given by that hash function.
If you want to insert an element,
– use hash function to find a list to put that element in.
– After that, check the list to see whether it already contains the element.
If the list does not have that element then insert the element at the
front.
– Statistically, a newly inserted element is often accessed again soon
after the insertion.
Code for an object that has a hash
function.
1. public interface Hashable
2.
{
3.
/**
4.
* Compute a hash function for this object.
5.
* @param tableSize the hash table size.
6.
* @return (deterministically) a number between
7.
* 0 and tableSize-1, distributed equitably.
8.
*/
9.
int hash( int tableSize );
10. }
How we use a Hashable object.
Public class Student implements Hashable{
private String name;
private double number;
private int year;
public int hash(int tableSize){
return SeparateChainingHashTable.hash(name, tableSize);
}
static method from our
HashTable class.
public boolean equals(Object rhs){
return name.equals(((Student)rhs).name);
}
}
1. public class SeparateChainingHashTable
2.
{
3.
/**
4.
* Construct the hash table.
5.
*/
6.
public SeparateChainingHashTable( )
7.
{
8.
this( DEFAULT_TABLE_SIZE );
9.
}
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
/**
* Construct the hash table.
* @param size approximate table size.
*/
public SeparateChainingHashTable( int size )
{
theLists = new LinkedList[ nextPrime( size ) ];
for( int i = 0; i < theLists.length; i++ )
theLists[ i ] = new LinkedList( );
}
20.
21.
22.
23.
24.
25.
26.
27.
28.
/**
* Insert into the hash table. If the item is
* already present, then do nothing.
* @param x the item to insert.
*/
We use Student
public void insert( Hashable x )
{
LinkedList whichList = theLists[ x.hash( theLists.length ) ];
LinkedListItr itr = whichList.find( x );
29.
30.
31.
if( itr.isPastEnd( ) )
whichList.insert( x, whichList.zeroth( ) );
}
32.
33.
34.
35.
36.
37.
38.
39.
/**
* Remove from the hash table.
* @param x the item to remove.
*/
public void remove( Hashable x )
{
theLists[ x.hash( theLists.length ) ].remove( x );
}
here
40.
41.
42.
43.
44.
45.
46.
47.
/**
* Find an item in the hash table.
* @param x the item to search for.
* @return the matching item, or null if not found.
*/
public Hashable find( Hashable x )
{
return (Hashable)theLists[ x.hash( theLists.length ) ].find( x
).retrieve( );
48.
}
49.
50.
51.
52.
53.
54.
55.
56.
/**
* Make the hash table logically empty.
*/
public void makeEmpty( )
{
for( int i = 0; i < theLists.length; i++ )
theLists[ i ].makeEmpty( );
}
57.
58.
59.
60.
61.
62.
63.
64.
65.
/**
* A hash routine for String objects.
* @param key the String to hash.
* @param tableSize the size of the hash table.
* @return the hash value.
*/
public static int hash( String key, int tableSize )
{
int hashVal = 0;
66.
67.
for( int i = 0; i < key.length( ); i++ )
hashVal = 37 * hashVal + key.charAt( i );
68.
69.
70.
hashVal %= tableSize;
if( hashVal < 0 )
hashVal += tableSize;
71.
72.
return hashVal;
}
73.
private static final int DEFAULT_TABLE_SIZE = 101;
74.
75.
/** The array of Lists. */
private LinkedList [ ] theLists;
76.
77.
78.
79.
80.
81.
82.
83.
84.
/**
* Internal method to find a prime number at least as large as n.
* @param n the starting number (must be positive).
* @return a prime number larger than or equal to n.
*/
private static int nextPrime( int n )
{
if( n % 2 == 0 )
n++;
85.
86.
for( ; !isPrime( n ); n += 2 )
;
87.
88.
return n;
}
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
/**
* Internal method to test if a number is prime.
* Not an efficient algorithm.
* @param n the number to test.
* @return the result of the test.
*/
private static boolean isPrime( int n )
{
if( n == 2 || n == 3 )
return true;
99.
100.
if( n == 1 || n % 2 == 0 )
return false;
101.
102.
103.
for( int i = 3; i * i <= n; i += 2 )
if( n % i == 0 )
return false;
104.
105.
return true;
}
106.
107.
108.
109.
// Simple main
public static void main( String [ ] args )
{
SeparateChainingHashTable H = new SeparateChainingHashTable( );
110.
111.
final int NUMS = 4000;
final int GAP = 37;
112.
System.out.println( "Checking... (no more output means success)" );
113.
114.
115.
116.
for( int i = GAP; i != 0; i = ( i + GAP ) % NUMS )
H.insert( new MyInteger( i ) );
for( int i = 1; i < NUMS; i+= 2 )
H.remove( new MyInteger( i ) );
117.
118.
119.
for( int i = 2; i < NUMS; i+=2 )
if( ((MyInteger)(H.find( new MyInteger( i ) ))).intValue( ) != i )
System.out.println( "Find fails " + i );
120.
121.
122.
123.
124.
125.
for( int i = 1; i < NUMS; i+=2 )
{
if( H.find( new MyInteger( i ) ) != null )
System.out.println( "OOPS!!! " + i );
}
126.}
}
Definition
num berOfElem entsInTheTable
• Load factor  
tableSize
• It is an average length of linked list.
Search time = time to do hashing + time to search
list
= constant + time to search list
• Unsuccessful search
Search time == average list length == load factor
• Successful search
– In a list that we will search, there is one node that
contains an object that we want to find. There are
other nodes too (0 or more).
– in a table, if we have N members, distributed into M
lists.
• There are N-1 nodes that do not have what we want.
• If we distribute these nodes evenly among the lists. Each list
will have (N-1)/M nodes.
• = lambda- (1/M)
• = lambda, because M is large.
• On average, half the list will be searched before we find what
we want. That is, lambda/2 steps will be executed.
• Therefore the average time to find the required element is 1 +
(lambda/2) steps.
• The tableSize is not important. What really matters is the
load factor.
Fixing collision by using Open
addressing
• No list.
• If there is a collision, then keep calculating
a new index until an empty slot is found.
– The new index is at h0(x), h1(x), …
– hi(x)=[hash(x)+f(i)]%tableSize, f(0)=0
• Every data must be put into our table.
Therefore the table must be large enough
to distribute data.
– Load factor <=0.5
Open addressing: linear probing
• F is a linear function of i.
• Normally we have -> f(i)=i
• It is “looking ahead one slot at a time.”
– This may take time.
There will be consecutive filled slots, called
primary clustering. If a new collision takes
place, it will take some time before we can
find another empty slot.
Open addressing: quadratic
probing
• There is no primary clustering by this method.
• We usually have -> f(i)=i2
• hi(x)=[hash(x)+f(i)]%tableSize
a
if b collides with a, we add 12 to find a new empty slot.
If c also collides with a, we add 12 to find b.
We need to go further by adding 22 instead.
• However, if our table is more than half full
or the tableSIze is not prime, this method
does not guarantee an empty slot.
• But if the table is not yet half full and the
tableSize is prime, it is proven that we can
always find an empty slot for a new value.
Proof
• Let the tableSize be a prime number
greater than 3.
Be 2 empty slot positions.
• Let (h(x)+i2) mod tableSize
•
(h(x)+j2) mod tableSize
• Prove by contradiction
 tableSize
0  i, j  

2

– Assume both positions are the same and i !=j.
h( x )  i 2  h( x )  j 2
i2  j2
i2  j2  0
(i  j )(i  j )  0
• i-j =0 is impossible because we assumed they
are not equal.
 tableSize
• i+j=0 is also impossible,
0  i, j 

2

• Therefore our assumption that the two positions
are the same is wrong.
• Thus the two positions are always different.
• So there is always a slot for a new value, if the
table is not yet half full and the tableSize is
prime.
Why prime?
• If not, the number of available slots will
greatly reduce.
• Example: tableSize == 16. Assume a
normal hashing gives index ==0.
(quadratic probing)
12
42
72
22
62
32
52
You can see that they fall in the same positions.
We cannot use ordinary deletion.
• If we remove , then later attempt to find
another value, we may encounter an
empty space and think that we cannot find
the value (in fact the value is in the table,
but requires jumping from a collision
point).
12
42
72
22
62
32
52
Use lazy deletion -> mark a deleted slot without actually
removing its element.
Open addressing
implementation
class HashEntry {
Hashable element; // the element
boolean isActive; // false means -> deleted
public HashEntry( Hashable e ){
this( e, true );
}
public HashEntry( Hashable e, boolean i ){
element = e;
isActive = i;
}
}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
public class QuadraticProbingHashTable{
private static final int DEFAULT_TABLE_SIZE = 11;
/** The array of elements. */
private HashEntry [ ] array; // The array of elements
private int currentSize;
// The number of occupied cells
public QuadraticProbingHashTable( ){
this( DEFAULT_TABLE_SIZE );
null
}
active nonactive
/**
* Construct the hash table.
* @param size the approximate initial size.
*/
public QuadraticProbingHashTable( int size ){
allocateArray( size );
makeEmpty( );
}
18.
/**
19.
20.
21.
22.
23.
24.
* Internal method to allocate array.
* @param arraySize the size of the array.
*/
private void allocateArray( int arraySize ){
array = new HashEntry[ arraySize ];
}
25.
/**
26.
27.
28.
29.
30.
31.
32.
* Make the hash table logically empty.
*/
public void makeEmpty( ){
currentSize = 0;
for( int i = 0; i < array.length; i++ )
array[ i ] = null;
}
33.
34.
35.
36.
37.
38.
39.
40.
/**
* Return true if currentPos exists and is active.
* @param currentPos the result of a call to findPos.
* @return true if currentPos is active.
*/
private boolean isActive( int currentPos ){
return array[ currentPos ] != null && array[ currentPos
].isActive;
}
41.
/**
42.
* Method that performs quadratic probing resolution.
43.
* @param x the item to search for.
44.
* @return the position where the search terminates.
45.
*/
46.
private int findPos( Hashable x ) {
f(i)=i2=f(i-1)+2i-1
47. /* 1*/
int collisionNum = 0;
48. /* 2*/
int currentPos = x.hash( array.length );
49. /* 3*/
while( array[ currentPos ] != null &&
50.
!array[ currentPos ].element.equals( x ) ){
51. /* 4*/
currentPos += 2 * ++collisionNum - 1; // Compute ith
probe
52. /* 5*/
if( currentPos >= array.length )
// Implement the
mod
53. /* 6*/
currentPos -= array.length;
54.
}
55. /* 7*/
56.
}
return currentPos;
57.
58.
59.
60.
61.
62.
63.
64.
65.
/**
* Find an item in the hash table.
* @param x the item to search for.
* @return the matching item.
*/
public Hashable find( Hashable x ){
int currentPos = findPos( x );
return isActive( currentPos ) ? array[ currentPos ].element :
null;
}
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
/**
* Insert into the hash table. If the item is
* already present, do nothing.
* @param x the item to insert.
*/
public void insert( Hashable x )
{
// Insert x as active
int currentPos = findPos( x );
if( isActive( currentPos ) )
return; //x is already inside, so do nothing
77.
array[ currentPos ] = new HashEntry( x, true );
78.
79.
80.
81.
// Rehash; see Section 5.5
if( ++currentSize > array.length / 2 )
rehash( );
}
82.
83.
84.
85.
86.
87.
O(N) because there are N
/**
* Expand the hash table.
members to be rehashed.
*/
This is not done often
private void rehash( )
because the table has to be
{
half filled first.
HashEntry [ ] oldArray = array;
88.
89.
90.
// Create a new double-sized, empty table
allocateArray( nextPrime( 2 * oldArray.length ) );
currentSize = 0;
91.
92.
93.
94.
// Copy table over
for( int i = 0; i < oldArray.length; i++ )
if( oldArray[ i ] != null && oldArray[ i ].isActive )
insert( oldArray[ i ].element );
95.
96.
return;
}
recalculate index because
this is a new array.
rehashing
• Rehash can be done due to 3 situations.
– Do it immediately when the table is half full.
– Do it when our insert starts to fail.
– Do it when a load factor is up to some value
(Does not have to be 0.5)
• Do not forget that the more the load factor value,
the more difficult it is to insert.
97. /**
98.
* Remove from the hash table.
99.
* @param x the item to remove.
100.
*/
101.
public void remove( Hashable x )
102.
{
103.
int currentPos = findPos( x );
104.
if( isActive( currentPos ) )
105.
array[ currentPos ].isActive = false;
106.
}
hash, nextPrime, isPrime are the
same as before.
Downside of quadratic probing
• Secondary clustering
• Fixed by double hashing:
– f(i) = i*hash2(x)
– We find hash2(x), 2 *hash2(x), …and so on.
• Must be careful when choosing a function.
– If our array has 9 slots and hash2(x) = x%9 ->
if we insert 99, we will always get 0.
– hash2(x) must not give 0.
Example of hash2
• Assume hash(x) = x%tableSize
• hash2(x)=R-(x%R) , R is prime and
R< tableSize
• Let our tableSize be 16. We insert 9, 25,
26, 41, 42, 58 respectively.
26
9
25
25 collides, so we add 13-(25%13)=1
26 collides, so we add 13-(26%13)=13
41
26
9
25
42
41 collides, so we add 13-(41%13)=11
42 collides, so we add 13-(42%13)=10
but 42 still collides, so we add 2*10 from its
original index.
58
41
26
9
25
42
58 collides, so we add 13-(58%13)=7