Hash Tables (open hashing)

Download Report

Transcript Hash Tables (open hashing)

CS 261 – Data Structures
Hash Tables
Part 1. Open Address Hashing
Can we do better than O(log n) ?
• We have seen how skip lists and AVL trees can reduce
the time to perform operations from O(n) to O(log n)
• Can we do better? Can we find a structure that will
provide O(1) operations?
• Yes. No. Well, Maybe….
Hash Tables
• Hash tables are similar to Arrays except…
– Elements can be indexed by values other than integers
– A single position may hold more than one element
• Arbitrary values (hash keys) map to integers by means of a
hash function
• Computing a hash function is usually a two-step process:
1. Transform the value (or key) to an integer
2. Map that integer to a valid hash table index
• Example: storing names
– Compute an integer from a name
– Map the integer to an index in a table (i.e., a vector, array, etc.)
Hash Tables
Say we’re storing names:
Angie
Joe
Abigail
Linda
Mark
Max
Robert
John
Angie, Robert
1 Linda
0
Hash Function
2
Joe, Max, John
3
4
Abigail, Mark
Hash Function: Transforming to an Integer
• Mapping: Map (a part of) the key into an integer
– Example: a letter to its position in the alphabet
• Folding: key partitioned into parts which are then combined using
efficient operations (such as add, multiply, shift, XOR, etc.)
– Example: summing the values of each character in a string
• Shifting: get rid of high- or low-order bits that are not random
– Example: if keys are always even, shift off the low order bit
• Casts: converting a numeric type into an integer
– Example: casting a character to an int to get its ASCII value
Hash Function: Combinations
• Another use for shifting: in combination with folding when the
fold operator is commutative:
Key
Mapped chars
Folded
Shifted and Folded
eat
5 + 1 + 20
26
20 + 2 + 20 = 42
ate
1 + 20 + 5
26
4 + 40 + 5 = 49
tea
20 + 5 + 1
26
80 + 10 + 1 = 91
Hash Function: Mapping to a Valid Index
• Almost always use modulus operator (%) with table size:
– Example: idx = hash(val) % data.size()
• Must be sure that the final result is positive.
– Use only positive arithmetic or take absolute value
– Remember smallest negative number, possibly use longs
• To get a good distribution of indices, prime numbers
make the best table sizes:
– Example: if you have 1000 elements, a table size of 997 or
1009 is preferable
Hash Functions: some ideas
• Here are some typical hash functions:
– Character: the char value cast to an int  it’s ASCII value
– Date: a value associated with the current time
– Double: a value generated by its bitwise representation
– Integer: the int value itself
– String: a folded sum of the character values
– URL: the hash code of the host name
Hash Tables: Collisions
• Ideally, we want a perfect hash function where each data
element hashes to a unique hash index
• However, unless the data is known in advance, this is
usually not possible
• A collision is when two or more different keys result in
the same hash table index
Example, perfect hashing
• Alfred, Alessia, Amina, Amy, Andy and Anne have a
club. Amy needs to store information in a six element
array. Amy discovers can convert 3rd letter to index:
Alfred
F=5%6=5
Alessia E = 4 % 6 = 4
Amina
I=8%6=2
Amy
Y = 24 % 6 = 0
Andy
D=3%6=3
Anne
N = 13 % 6 = 1
Indexing is faster than searching
• Can convert a name (e.g. Alessia) into a number (e.g. 4)
in constant time.
• Even faster than searching.
• Allows for O(1) time operations.
• Of course, things get more complicated when the input
values change (Alan wants to join the club, since ‘a’ = 0
same as Amy, or worse yet Al who doesn’t have a third
letter!)
Hash Tables: Resolving Collisions
There are several general approaches to resolving
collisions:
1. Open address hashing: if a spot is full, probe for next empty spot
2. Chaining (or buckets): keep a Collection at each table entry
3. caching: save most recently access value, slow search otherwise
Today we will examine Open Address Hashing
Open Address Hashing
• All values are stored in an array.
• Hash value is used to find initial index to try.
• If that position is filled, next position is examined, then
next, and so on until an empty position is filled
• The process of looking for an empty position is termed
probing, specifically linear probing.
• There are other probing algorithms, but we won’t
consider them.
Example
• Eight element table using Amy’s hash function.
0-aiqy
Amina
1-bjrz
2-cks
3-dlt
Andy
4-emu
5-fnv
Alessia
Alfred
6-gpw
7-hpq
Aspen
Now Suppose Anne wants to Join
• The index position (5) is filled by Alfred. So we probe to
find next free location.
0-aiqy
Amina
1-bjrz
2-cks
3-dlt
Andy
4-emu
5-fnv
Alessia
Alfred
6-gpw
7-hpq
Anne
Aspen
Next comes Agnes
• Her position, 6, is filled by Anne. So we once more
probe. When we get to the end of the array, start again at
the beginning. Eventually find position 1.
0-aiqy
1-bjrz
Amina
Agnes
2-cks
3-dlt
Andy
4-emu
5-fnv
Alessia
Alfred
6-gpw
7-hpq
Anne
Aspen
Finally comes Alan
• Lastly Alan wants to join. His location, 0, is filled by
Amina. Probe finds last free location. Collection is now
completely filled. (More on this later)
0-aiqy
1-bjrz
Amina
Agnes
2-cks
Alan
3-dlt
Andy
4-emu
5-fnv
Alessia
Alfred
6-gpw
7-hpq
Anne
Aspen
Next operation, contains test
• Hash to find initial index, move forward examining each
location until value is found, or empty location is found.
• Search for Amina, Search for Anne, search for Albert
• Notice that search time is not uniform
0-aiqy
Amina
1-bjrz
2-cks
3-dlt
Andy
4-emu
5-fnv
Alessia
Alfred
6-gpw
7-hpq
Aspen
Final Operation: Remove
• Remove is tricky. Can’t just replace entry with null. What
happens if we delete Agnes, then search for Alan?
0-aiqy
Amina
1-bjrz
2-cks
Alan
3-dlt
Andy
4-emu
5-fnv
Alessia
Alfred
6-gpw
7-hpq
Anne
Aspen
How to handle remove
• Simple solution: Just don’t do it. (we will do this one)
• Better: create a tombstone:
– A value that marks a deleted entry
– Can not be replaced with new entry
– But doesn’t halt a search
0-aiqy
1-bjrz
Amina
TOMB
STONE
2-cks
Alan
3-dlt
Andy
4-emu
5-fnv
Alessia
Alfred
6-gpw
7-hpq
Anne
Aspen
Hash Table Size - Load Factor
• Load factor:
Load factor
# of elements
l=n/m
Size of table
– So, load factor represents the average number of elements at
each table entry
– For open address hashing, load factor is between 0 and 1 (often
somewhere between 0.5 and 0.75)
– For chaining, load factor can be greater than 1
• Want the load factor to remain small
What to do with a large load factor
• Common solution: When the load factor becomes too
large (say, bigger than 0.75) then reorganize.
• Create a new table with twice the number of positions
• Copy each element, rehashing using the new table size,
placing elements in new table
• The delete the old table
• Exactly like you did with the dynamic array, only this
time using hashing.
Hash Tables: Algorithmic Complexity
• Assumptions:
– Time to compute hash function is constant
– Worst case analysis  All values hash to same position
– Best case analysis  Hash function uniformly distributes the
values (all buckets have the same number of objects in them)
• Find element operation:
– Worst case for open addressing  O(n)
– Best case for open addressing
 O(1)
Hash Tables: Average Case
• What about average case?
• Turns out, it is 1/(1-l)
• So keeping load factor small is very important
l
(1/(1-l))
0.25
1.3
0.5
0.6
2.0
2.5
0.75
4.0
0.85
6.6
0.95
19.0
Your turn
• Complete the implementation of the hash table
• Use hashfun(value) to get hash value
• Don’t do remove.
• Do add and contains test first, then do the internal
reorganize method