No Slide Title

Download Report

Transcript No Slide Title

Hashing
1. Def. Hash Table an array in which items are
inserted according to a key value (i.e. the key value is
used to determine the index of the item).
Ex. Student records stored in an array where each
student is assigned an id no. and that number is used
for the index. Are there any problems with this idea?
Gaps will develop if students leave and insertions of
new students are limited by the original size of array.
Knowing the student id no. is not convenient.
Using the index itself as the key field is not efficient.
2. Def. Hash Function - a function used to
convert numbers from a large range into numbers
in a small range.
(The key field is usually the large range and the
index of the array is usually the small range.)
Ex. Dictionary of 50,000 words. Use the word itself as
the key field, but code it numerically to determine a
unique location to store the word in the array.
Let a = 1, b = 2, c = 3, …z = 26 and let positions
of letters in the word have power of ten values:
Ex. dab = 4 * 102 + 1 * 101 + 2 * 100 = 412
What size array would be needed to store these 50,000
words, if no word is longer than 10 characters?
zzzzzzzzzz would have the code 28,888,888,890!
(too big - bigger than largest int - no array could be
that big) Also, if locations were chosen this way, there
would be many many empty cells.
What size array should be needed for this dictionary?
100,000 - usually twice as large as the no. of items to
allow room for collisions (def. obvious but coming up)
A hash function is needed to convert the numeric
code to a smaller range.
Commonly used hash function:
index = largerange % arraysize
Ex. Hash the word gave to find its location in
the array dictionary.
7*103 + 1*102 + 22*101 + 5*100 = 7325
Ex. Hash the word gaty to find its location in the
array dictionary.
7*103 + 1*102 + 20*101 + 25*100 = 7325
COLLISION!
3. Def. Collision - hashvalue of occupied cell occurs.
4. There are 2 methods to resolve collisions:
Def. Open addressing - in case of collision, search
for or store in some other available cell.
Def. Separate chaining - install a linked list at
each index of the array and insert all items that
hash to an index into the list.
5. Types of open addressing:
Linear probe method - if collision occurs at
index x, search locations x+1, x+2, etc.
Ex. Gaty would be stored in location 7326 (if
available) otherwise location 7327, or 7328, etc.
Note: resolves collisions but primary clusters occur.
Quadratic probe method - search x+1, x+22, x+23 etc.
Note: resolves primary clusters, but secondary clusters
occur.
Rehashing (also called double hashing) - when
collision occurs determine step to search for
available cell by hashing the key value again by
a new function.
Ex. Step = 5 - key % 5
What steps result? 5,4,3,2,1
How is this different from the linear &
quadratic probe methods?
The step is different for different keys.
Note: table size must be prime in order to probe all cells.
(ex. size=20, step=5, x=0: 0,5,10,15,0,5, 10,15,…
try size=19, step=5, x=0:
0,5,10,15,1,6,11,16,2,7,12,17,3,8,13,18,4,9,14
Write code to increase a hash value by step.
Hashval += step
What do we do if a hash value becomes
greater than the size of the array?
Wrap around: hashval %= arraysize
What do we do about duplicate key values?
Should not be allowed. When first item with key is found,
search stops. Second item with same key would never be
found (unless code is change. Select key value that is
unique to the item. (ex. Social security no.)
How do we handle deletions?
Replace one field by -1 rather than replace entire object by
null. Often object info may be needed in the future. Ex.
Even when employee leaves, pension & tax info is needed.
However, there is another reason in this code. Something
undesirable occurs if the object is replaced by null.
Demonstrate what and explain why.
What method requires this condition and why?
While (hashRay[hashVal] != null && hashRay[hashVal].iData != -1)
6. Def. Load factor - the ratio of the no. of items
in a hash table to the size of the table (array).
The more full a table is the worse clustering
becomes. Therefore, hash tables should be
designed to never become more than 1/2 to 2/3
full when open addressing is used.
7. When separate chaining is used to avoid
collisions, is load factor a concern?
No. n items or more can be placed in a table of size n
and the load factor will be 1 or more.(i.e.some locations
will hold 1 or more items in its linked list.)
How do we handle duplicates with separate chaining?
Duplicates are allowed and will be stored in the same
list. Note: search process slows as list is searched
linearly.
How do we handle deletions?
Deletions can be made from a linked list, if
appropriate for the application, without empty cell
problems resulting.
7. What is the advantage of a hash table?
O(1) complexity to search for or
insert an item (i.e. constant time
regardless of the number of
items).
8. Disadvantage?
Must know size of array needed in
advance (in Java arrays can not be resized
- another bigger array would be needed).
This problem is reduced when separate
chaining is used.
Also, there is no way to access items in order.