Searching: Binary Trees and Hash Tables

Download Report

Transcript Searching: Binary Trees and Hash Tables

Hash Tables
Gordon College
CS212
1
Hash Tables
• Recall order of magnitude of searches
– Linear search O(n)
– Binary search O(log2n)
– Balanced binary tree search O(log2n)
– Unbalanced binary tree can degrade to O(n)
2
Hash Tables
• In some situations faster search is needed
– Solution is to use a hash function
– Value of key field given to hash function
– Location in a hash table is calculated
Like an array but
much better:
Do not have to
set aside space
to account for
every possible
key
3
Hash Functions
mapping from key to index
• Simple function: mod (%)
the key by arbitrary integer
int h(int i)
{ return
i % maxSize;
}
• Note the max number of locations in table
maxSize
4
Hash Function Access
hf(22) = 22
hf(4) = 4
22 % 7 = 1
0
1
tableEntry[1]
4%7=4
2
3
4
tableEntry[4]
5
6
• Note that we have traded speed for wasted space
–Table must be considerably larger than number of items
anticipated
5
Hash Function Access
• Example:
7 digit serial number
Need 10 million records*
Why 10 million
records?
n!/(n-r)! 10!/(10-7)!
Number of r-permutations of a set with n
elements
* Not practical to have this much space when in
reality you are only stocking at most a few
thousand records
6
Hash Function (Mapping)
• Example:
7 digit serial number
Use only 10000 slots
Hashing (Mapping) function - unsigned int Hf(int key)
Hf(1234567) = 1234567 % 10000 = 4567
1234567/10000 = 123.4567
1234567 - (123 * 10000) = 4567
7
Hash Function (Mapping)
• Design Considerations
– Efficient
– Minimize collisions
– Produce uniformly distributed mappings
• (helps minimize collisions)
– Must be able to deal with int, char, string, etc. types for
keys
– Must be able to associate a hash function with a
container
8
Function Objects
• Can pass a function to a function
• Can use Function Objects
template<typename t>
class functionobject
{
public:
returntype operator() (arguments) const
{
return returnvalue;
}
…….
};
9
Function Objects
Example function class: less than
template<typename T>
class lessThan
{
public:
bool operator() (const T& x, const T& y) const
{
return x < y;
}
};
10
Function Objects
Example function class use
template <typename T, typename Compare>
void insertionSort(vector<T>& v, Compare comp)
{
int i, j, n = v.size();
T temp;
…..
}
Called:
insertionSort(v, lessThan<int>());
11
Function Objects
Example function class use (as seen with the SET container)
template<typename T>
class lessThan
{
public:
bool operator() (const T& x, const T& y) const
{
return x < y;
}
};
set <int, lessThan<int> > A(arr, arr+arrSize);
for(set <int, lessThan<int> >::iterator ii=A.begin();ii!=A.end();ii++)
cout << *ii << " ";
cout << endl;
12
Collisions
Hash Function Access Problem
Collisions are possible:
Depending on the number of slots and the size of the key mapping
13
Collisions
Hash Function Access Problem
• Problem: same value returned by h(i) for
different values of i
– Called collisions
• Simple solution: linear probing
– Linear search begins at
collision location
– Continues until empty
slot found for insertion
14
Linear Probing
0
77
1
0
77
1
0
77
1
0
77
1
1
89
1
1
89
1
1
89
1
1
89
1
2
45
2
2
45
2
2
45
2
3
14
1
3
14
1
3
14
1
35
3
4
35
3
5
76
7
6
94
1
54
1
2
3
14
1
4
4
4
5
5
5
6
94
1
6
94
1
6
94
1
7
7
7
7
8
8
8
8
9
9
9
9
10
54
1
Insert
54, 77, 94, 89, 14
(a)
10
54
Insert
45
(b)
1
10
54
Insert
35
(c)
1
10
Insert
76
(d)
15
Hash Functions
• Retrieving a value:
linear probe until found
– If empty slot encountered
then value is not in table
• What if deletions permitted?
Slot can be marked so it will
not be empty and cause an
invalid linear probe
16
Hash Functions
• Improved performance strategies:
– Increase table capacity (less collisions)
– Use different collision resolution technique
– Devise different hash function
• Hash table capacity
– Size of table must be 1.5 to 2 times the size of
the number of items to be stored
– Otherwise probability of collisions is too high
17
Other Collision Strategies
• Linear probing can result in primary clustering
Consider:
• quadratic probing
– Probe sequence from location i is
i + 1, i – 1, i + 4, i – 4, i + 9, i – 9, …
– Secondary clusters can still form
• Double hashing
– Use a second hash function to determine probe
sequence
• hF(key) --> index hF(index)--> next index
18
Collision Strategies
• Chaining
– Table is a list or vector of head nodes to linked
lists
– When item hashes to location, it is added to that
linked list
19
Chaining
< bucket 0 >
< bucket 1 >
< bucket 2 >
....
< bucket n-1 >
< Bucket 0 >
77(1)
< Bucket 1 >
89(1)
< Bucket 2 >
35(1)
< Bucket 3 >
14(1)
45(2)
< Bucket 4 >
< Bucket 5 >
< Bucket 6 >
94(1)
< Bucket 7 >
< Bucket 8 >
< Bucket 9 >
< Bucket 10>
54(1)
76(2)
20
Improving the Hash Function
• Ideal hash function
– Simple to evaluate (fast)
– Scatters items uniformly throughout table
• Modulo arithmetic not so good for strings
– Possible to manipulate numeric (ASCII) value of
first and last characters of a name
21
Hash Function (basic mapping)
class hFintID
{
public:
unsigned int operator() (int item) const
{
return (unsigned int) item % 10000;
}
};
hFintID hf;
Hf(12341234) = 1234;
22
Hash Function (better)
Midsquare technique
class hFint
mixes up the digits in the serial number
{
public:
unsigned int operator() (int item) const
{
unsigned int value = (unsigned int) item;
value *= value;
value /=256; //discard low order 8 bits
// (division performs a shift right)
return value % 65536;
}
};
23
String Hash Functions
class hFstring
GOAL: random distribution
{
public:
unsigned int operator() (const string & item) const
{
unsigned int prime = 2049982463;
int n = 0, i;
for (i = 0; i < item.length(); i++)
n = n*8 + item[i];
return n > 0 ? (n % prime) : (-n % prime);
}
};
24
Custom Hash Functions
class hfCode
{
public:
unsigned int operator() (const code & item) const
{
return (unsigned int )item.getNum % NumofSlots;
}
};
FILE0000.CHK, FILE0001.CHK, FILE0002.CHK
25
Search Algorithms
Sequential Search
- search O(n) (fairly slow)
+ good when data set size is small and does have to be sorted
Binary Search (sorted vector)
+ search O(log n) [much faster]
+ low cost when it comes to space
- however, requires data be sorted
- not good when the data set is very dynamic (sorting overhead)
Binary Search Tree
+ search O(log n)
+ can scan data in order
- higher cost when it comes to space (various pointers)
Hashing
+ search O(1) [fastest]
- higher cost when it comes to space (depends on method)
26