Transcript Slide 1

CSE331- Hashing
Chapter 12: Advanced
Associative Containers
Hash Functions
What works
C++ Function Objects
BST vs Hash Table
Hash Function
Hash Iterators
Collision
Efficiency of Hash Methods
Coping with Collisions
Open addressing & linear
probing
Summary Slides
Chaining with separate lists
1
Main Index
Contents





BST vs Hash Table
Both used to implement Sets & Maps
Binary Search Tree – ordered associative
container
Order (log N) access (average & worst)
Hash Table – unordered associative container
Order(1) access (average case)
Main Index
Contents

Hash Function
A hash function converts a key into a numeric
(unsigned int) table index

Ideal hash functions uniformly distribute keys
to all available indices

When two keys hash to the same index a
collision occurs

Keys are not in any particular order (numeric,
alphabetical, ...) within the table
Main Index
Contents
Example Hash Function
hf(22) = 22
hf(4) = 4
22 % 7 = 1
0
1
tableEntry[1]
4%7=4
2
3
4
tableEntry[4]
5
6
4
Main Index
Contents
Example Hash Function
hf(22) = 22
22 % 7 = 1
hf(36)=36 ---- 36%7 = 1
hf(4) = 4
4%7=4
0
1
tableEntry[1]
2
3
4
tableEntry[4]
5
6
5
Main Index
Contents

Coping with Collisions
Three primary methods exist for coping with
collisions
–
–
–

Rehashing: use same key but different hash
function
Linear Probing: examine succesive locations (index,
index+1, index+2, ...)
Chaining: implement table with separate list at each
table[index] location
Note: Except for the last case, the table is a
fixed size.
Main Index
Contents
Hash Table Using Linear
Probing – Open Addressing
0
77
1
0
77
1
0
77
1
0
77
1
1
89
1
1
89
1
1
89
1
1
89
1
2
45
2
2
45
2
2
45
2
3
14
1
3
14
1
3
14
1
35
3
4
35
3
5
76
7
6
94
1
54
1
2
3
14
1
4
4
4
5
5
5
6
94
1
6
94
1
6
94
1
7
7
7
7
8
8
8
8
9
9
9
9
10
54
1
Insert
54, 77, 94, 89, 14
(a)
7
10
54
1
Insert
45
(b)
Main Index
10
54
Insert
35
(c)
Contents
1
10
Insert
76
(d)
Linear Probing PseudoCode
// insert item into table of size n using hashFunc() to
// calculate index. this assumes no duplicate keys, and some
// method of indicating that a hash table location is empty
int index = hashFunc(item) % n;
int origIndex = index;
do
{
if table[index] is empty
insert item as table[index] and return
else if table[index] matches item
return
index = (index+1) % n; // this is next location to probe
} while (index != origIndex);
throw overflowError;
// if we get here, table is full & does
// not contain item
Main Index
Contents

Problems with Linear Probing
Clustering of items occurs & degrades
performance as number of items approaches
size of table
–
Colliding items fill in gaps between other entries
–
This forms runs or clusters within the table
–
Items in the cluster are a mix of items that hash to
different indices, so long sequences of repeated
probes are required to find what we seek
Main Index
Contents
Chaining – Uses Lists or Buckets

Implement the hash table as a vector of lists
–
–
–
–
–
Each list (bucket, chain, ...) contains all items that
hash to the associated table location
Buckets are not mixed like clusters in linear probing
Table size can grow easily by expanding individual
buckets as necessary
The number of buckets stays constant
Within a bucket, items are unordered and must be
searched linearly
Main Index
Contents
Chaining with Separate Lists
Example
< Bucket 0 >
77(1)
< Bucket 1 >
89(1)
< Bucket 2 >
35(1)
< Bucket 3 >
14(1)
45(2)
< Bucket 4 >
< Bucket 5 >
< Bucket 6 >
94(1)
< Bucket 7 >
< Bucket 8 >
< Bucket 9 >
< Bucket 10>
Main Index
54(1)
Contents
76(2)



C++ Function Objects
Function object is an instance of a class that
contain only a single function – operator()
Function objects are easily passed as
parameters to other functions
Commonly used to implement hash functions
and comparison operations
template <typename T>
class greaterThan {
public:
bool operator() (const T& x, const T& y) const
{ return x > y; }
};
Main Index
Contents
Reasonable Hash Functions

Integer keys
–
Identity function is good if key number is random or
a portion of it is random
class hfIntKey {
public:
bool operator() (int key) const
{
return key;
}
};
Main Index
Contents
Reasonable Hash Functions

Integer keys
–
Midsquare technique (extracts middle two bytes of 4
byte square of key) -- works if key is not random
class hfMidSq {
public:
bool operator() (int key) const
{
unsigned int n = key;
return ((n*n)/256) % 65536; // 0 .. 2^16-1
}
};
Main Index
Contents
Reasonable Hash Functions

String keys
–
Simple function uses ASCII codes for the string characters to
build n-digit unsigned integers out of n-digit strings
class hfString {
public:
bool operator() (string key) const
{
unsigned int prime = 2049982463;
int n(0);
for (int i=0; i < key.length(); i++)
n = n*8 + item[i];
return (n > o ? (n % prime) : (-n % prime) );
}
};
Main Index
Contents
Reasonable Hash Functions

String keys
–
–
Folding uses substrings as numbers and combines them by
addition or multiplication or …
Example: use 3 character substrings of SSN
class hfSSN {
public:
bool operator() (string ssn) const
{
return ( atoi(ssn.substr(0,3).cstr())
+ atoi(ssn.substr(3,3).cstr())
+ atoi(ssn.substr(6,3).cstr()) );
}
};
Main Index
Contents

Hash Class – not in STL
See headers in textbook include folder
–
–
–
–
d_hash.h – for the hash table using buckets
d_hashf.h – for hash function object
d_uset.h – for unordered set based on hash class
d_hiter.h – for hash class iterator and const_iterator
Main Index
Contents
Hash Class – Not in STL
template <typename T, typename HashFunc>
class hash
{
public :
hash (int nbuckets,
const HashFunc& hfunc = HashFunc());
hash (T *first, T *last, int nbuckets,
const HashFunc& hfunc = HashFunc());
bool empty() const;
int size() const;
iterator find(const T& item);
pair<iterator,bool> insert(const T& item);
int erase(const T& item);
void erase(iterator pos);
void erase(iterator first, iterator last);
iterator begin();
const_iterator begin() const;
iterator end();
const_iterator end() const;
private:
int numBuckets;
// number of buckets
vector<list<T> > bucket; // table is vector of lists
HashFunc hf;
// hash function
int hashtableSize;
// number of elements
};
Main Index
Contents
Hash::find(item)
template <typename T, typename HashFunc>
hash<T,HashFunc>::iterator hash<T,HashFunc>::find(
const T& item)
{
int hashIndex = int(hf(item) % numBuckets);
list<T>& myBucket = bucket[hashIndex];
list<T>::iterator bucketIter;
// traverse list and look for a match with item
bucketIter = myBucket.begin();
while(bucketIter != myBucket.end())
{
if (*bucketIter == item)
// return iterator to found item
return iterator(this, hashIndex, bucketIter);
bucketIter++;
}
// did not find item, so return iterator to table end
return end();
}
Main Index
Contents
Hash::insert(item)
template <typename T, typename HashFunc>
pair<hash<T, HashFunc>::iterator,bool>
hash<T, HashFunc>::insert(const T& item)
{
int hashIndex = int(hf(item) % numBuckets);
list<T>& myBucket = bucket[hashIndex];
list<T>::iterator bucketIter;
bool success;
bucketIter = myBucket.begin();
while (bucketIter != myBucket.end())
if (*bucketIter == item)
break; // found the item already in bucket
else
bucketIter++;
if (bucketIter == myBucket.end()) {
bucketIter = myBucket.insert(bucketIter, item);
success = true;
hashtableSize++;
}
else
success = false; // item already in table
return pair<iterator,bool> (iterator(this,
hashIndex, bucketIter),
success);
}
Main Index
Contents
Hash Iterator hIter Referencing
Element 22 in Table ht
hash<int, hFintID> ht;
hash<int, hFintID>::iterator hIter;
hashTable = &ht
hIter
currentBucket=2
currentLoc
hf(x) = x
buckets[0]
10
emp ty
buckets[1]
ht
buckets[2]
2
emp ty
buckets[3]
buckets[4]
21
22
29
Main Index
Contents
*hIter = 22.
Determining Performance


Load factor (m = size of table, n = items in table)
Measures table density
  n/m

Linear addressing (m = size of vector, maxitems)
Chaining (m = number of buckets)

Worst case

–
–

(all items hash to same table location or bucket)
Linear search is O(n)
Making table size prime helps prevent nonuniform
distribution causing this worst case
Main Index
Contents
Average Case - Chaining



Finding bucket is O(1) – using hash function
Uniform hashing implies each bucket has n/m items
Assuming uniform hash distribution
–
–

The ith item was inserted at the end of its bucket when the
previous (i-1) items were spread evenly over the m buckets
To find this item takes 1+(i-1)/m comparisons since there are (on
average) (i-1)/m items ahead of it in its bucket
Average performance of search for an arbitrary item is
the average of the number of comparisons required to
find each item in the list
1 n
i 1
 1
(1 
)  1 

n i 1
m
2 2m
Main Index
Contents
Efficiency of Hash Methods
Hash table size = m, Number of elements
m
in hash table = n, Load factor  = n
Open
Probe
Chaining
24
Average Probes
for Successful
Search
Average Probes
for Unsuccessful
Search
1
1

2 2(1   ) 2
1
1

2 2(1   )

1
1 
2 2m
Main Index

Contents
Summary Slide 1
§- Hash Table
- simulates the fastest searching technique, knowing
the index of the required value in a vector and array
and apply the index to access the value, by applying
a hash function that converts the data to an integer
- After obtaining an index by dividing the value from
the hash function by the table size and taking the
remainder, access the table. Normally, the number
of elements in the table is much smaller than the
number of distinct data values, so collisions occur.
- To handle collisions, we must place a value that
collides with an existing table element into the table
in such a way that we can efficiently access it later.
25
Main Index
Contents
Summary Slide 2
§- Hash Table (Cont…)
- average running time for a search of a hash table is
O(1)
- the worst case is O(n)
26
Main Index
Contents
Summary Slide 3
§- Collision Resolution
- Two types:
1) linear open probe addressing
- the table is a vector or array of static size
- After using the hash function to compute a
table index, look up the entry in the table.
- If the values match, perform an update if
necessary.
- If the table entry is empty, insert the value in
the table.
27
Main Index
Contents
Summary Slide 4
§- Collision Resolution (Cont…)
- Two types:
1) linear open probe addressing
- Otherwise, probe forward circularly, looking
for a match or an empty table slot.
- If the probe returns to the original starting
point, the table is full.
- you can search table items that hashed to
different table locations.
- Deleting an item difficult.
28
Main Index
Contents
Summary Slide 5
§- Collision Resolution (Cont…)
2) chaining with separate lists.
- the hash table is a vector of list objects
- Each list is a sequence of colliding items.
- After applying the hash function to compute
the table index, search the list for the data
value.
- If it is found, update its value; otherwise, insert
the value at the back of the list.
- you search only items that collided at the
same table location
29
Main Index
Contents
Summary Slide 6
§- Collision Resolution (Cont…)
- there is no limitation on the number of values
in the table, and deleting an item from the
table involves only erasing it from its
corresponding list
30
Main Index
Contents