ICS220 – Data Structures and Algorithms

Download Report

Transcript ICS220 – Data Structures and Algorithms

Data Structures and
Algorithms
Hashing
Dr. Ken Cosh
Review
• Sorting Algorithms
– Elementary
• Insertion Sort
• Selection Sort
• Bubble Sort
– Efficient
•
•
•
•
•
Shell Sort
Heap Sort
Quick Sort
Merge Sort
Radix Sort
Searching so far
• We have encountered searching algorithms
several times already in the course;
– Linked List searches O(n)
– Pick a Number
• The efficiency of searches has varied,
depending on how effectively the data has been
arranged.
• This week we look at an alternative approach to
searching – where the data could be found in
constant time O(1)
Searching in Constant Time
• In order the find data in constant time, we
need to know where to look for it.
• Given a ‘key’, which could be in any form
(alphanumeric), we need to return an
index for some table (or array).
• A function which converts a key to a
address is known as a Hash function.
– If that address turns out to be a unique
address it is a perfect hash function.
Hashing Example
• Take a student id number (IDNum);
– 478603
• A possible hash function could be;
– H(IDNum) = IDNum % 1000
– Which would return – what?
– This number could then be the array index
number.
Hashing
• If only Hashing was that simple…!
• There is a problem with the function, the hash
function will return a total of 1000 possible
different indexes;
– What happens when there are more than 1000
students?
• When a hash function returns the same index for
more than one key, there is a collision.
• A hash table, needs to contain at least as many
positions as the number of elements to be
hashed.
Hashing Example 2
• Suppose we need to convert a variable
name into a data location.
– int ken = 31;
• We need a hash function that could return
a unique address for each variable name;
– H(“ken”)
• Consider how many different variable
names there could be?
• How large should the hash table be?
Hashing Example cont.
• Suppose set the function H() to sum the
values of each letter in the variable name;
– k=11,e=5,n=14;
– H(“ken”) = 30.
• Therefore we could store the ken data in
index 30.
• We can use this bad hashing function to
highlight some problems that hashing
functions should address;
Hashing Problems
•
If we have a program with 4 variables;
–
–
–
–
1)
2)
name
H(“name”) = 33
age
H(“age”) = 13
gender
H(“gender”) = 53
mean
H(“mean”) = 33
The data is spread out throughout the table – with
many unused wasted cells.
There is a collision between name and mean.
• These two problems have to be solved by a
simple, efficient algorithm.
Good Hash Functions
• A good hash function should:
- be easy and quick to compute
- achieve an even distribution of the key values that
actually occur across the index range supported by the
table
• Typically a hash function will take a key value and:
- chop it up into pieces, and
- mix the pieces together in some fashion, and
- compute an index that will be uniformly distributed
across the available range.
• Note: hash functions are NOT random in any sense.
Approaches
• Truncation
– Ignore a part of the key value and use the remainder
as the table index
• e.g., 21296876 maps to 976
• Folding:
– Partition the key, then combine the parts in some
simple manner
• e.g., 21296876 maps to 212 + 968 + 76 = 1256 and then
mod to 256
• Modular Arithmetic:
– Convert the key to an integer, and then mod that
integer by the size of the table
• e.g., 21296876 maps to 876
Truncation Caution
• It is a good idea if the entire key has some
impact on the hash function, simply
truncating a key may lead to many keys
returning the same result when hashed.
– Consider truncating the last 3 letters of the
following keys;
• hash, mash, bash, trash.
Hash Function
int strHash(string toHash, const int TableSize) {
int hashValue = 0;
for (unsigned int Pos = 0; Pos < toHash.length(); Pos++) {
hashValue = hashValue + int(toHash.at(Pos));
}
return (hashValue % TableSize);
}
Given the key ‘ken’ and a table size of 1000, what
would be returned?
Improving the hash function
• The hash function given on the previous slide
would return the same result if we put either of
the following keys in; sham or mash
– The hash function didn’t take position into account.
• This can easily be remedied with the following
change;
hashValue = 4*hashValue + int(toHash.at(Pos));
• This is known as Collision Reduction, or rather
reducing the chance of collision.
Resolving Collisions
• Even with a sophisticated hashing function
it is likely that collisions will still occur, so
we need a strategy to deal with collisions.
• We first find out about a collision if we try
to insert data into a position which is
already filled.
– In this case we can simply insert the data into
a different available position, leaving a record
so the data can be retrieved.
Linear Probing
• Linear probing deals with collisions by
inserting a new value into the next
available space after the space returned
by the hash function;
• If H(key) is occupied store data in
H(key)+1.
Linear Probing, problem
Consider the following hash table
a
b
c
d
If c is duplicated, the new value is placed in the successive cell.
a
a
b
b
c c d
c c d c
This leads to clustering, which contradicts one of our key objectives.
Quadratic Probing
• Quadratic Probing is designed to combat the clustering
effect of linear probing.
• Rather than inserting the data into the next available cell,
data is inserted into a cell based on the following
sequence;
–
–
–
–
–
k
k+1
k+4
k+9
k+16
• While this solves the problem of clustering, it produces a
problem that the hash function may not try every slot in
the table (if table size is a prime, then approximately half
of the cells will be tested).
General Increment Probing
• A general Increment Probing approach will try
each cell in a sequence based on a formula;
–
–
–
–
–
k
k+s(1)
k+s(2)
k+s(3)
k+s(4)
• Here care must be taken that the formula
doesn’t return to the first cell too quickly.
• What happens if s(i) = i2 ?
Key dependent Probing
• Another probing strategy could calculate
the formula based on some part of the
original key – perhaps just adding the
value of the first segment of the key.
• However this could produce inefficient
code.
Deletion
• Given this approach to collision resolution,
care needs to be taken when deleting data
from a cell.
– Why?
• The tombstone method marks a deleted
cell as available for insertion, but marks it
has having had data in it.
Alternative to Probing
• An alternative strategy to probing is to use
Separate Chaining.
– Here more than one piece of data can be
associated to the same cell in the hash table
– The cell can contain a pointer to a linked list
of data insertions.
– This is sometimes known as a bucket.