EEM 480 - Anadolu University

Download Report

Transcript EEM 480 - Anadolu University

EEM 480
Lecture 11
Hashing and Dictionaries
Symbol Table

Symbol tables are used by compilers to keep track of
information about







variables
functions
class names
type names
temporary variables
etc.
Typical symbol table operations are Insert, Delete and
Search

It's a dictionary structure!
Symbol Table

What kind of information is usually stored in a symbol
table?







Type ( int, short, long int, float, …)
storage class (label, static symbol, external def, structure tag,..)
size
scope
stack frame offset
register
We also need a way to keep track of reserved words.
Symbol Table
Where is a symbol table stored?
 array/linked
list
 simple,
but linear lookup time
 However, we may use a sorted array for reserved
words, since they are generally few and known in
advance.
 balanced
 O(log
 hash
tree
n) lookup time
table
 most
common implementation
 O(1) amortized time for dictionary operations
Hashing

Depends on mapping keys into positions in a
table called hash table

Hashing is a technique used for performing
insertions, deletions and searches in constant
average time
Hashing
•In this example john maps 3
•Phil maps 4 …
•Problem :
•How mapping will be done?
•If two items maps the same
place what happens?
A Plan For Hashing

Save items in a key-indexed table. Index is a function of the key.

Hash function.


Collision resolution strategy.


Trivial hash function with key as address.
If there is no time limitation


Algorithm and data structure to handle two keys that hash to the same index.
If there is no space limitation


Method for computing table index from key.
Trivial collision resolution = sequential search.
Limitations on both time and space: hashing (the real world)
Hashing

Hash tables
use array of size m to store elements
 given key k (the identifier name), use a function h to
compute index h(k) for that key




collisions are possible
two keys hash into the same slot.
Hash functions
is easy to compute
 avoids collisions (by breaking up patterns in the keys
and uniformly distributing the hash values)

Hashing

Nomenclature
k is a key
 h(k) is the hash function
 m is the size of the hash table
 n is the number of keys in the hash table

What is Hash

(in Wikipedia) Hash is an American dish consisting
of a mixture of beef (often corned beef or roast
beef), onions, potatoes, and spices that are mashed
together into a coarse, chunky paste, and then cooked,
either alone, or with other ingredients.

Is it related with our definition????

to chop any patterns in the keys so that the results are
uniformly distributed
What is Hashing
Hashing is the transformation of a string
of characters into a usually shorter fixed-length
value or key that represents the original string.
Hashing is used to index and retrieve items in
a database because it is faster to find the item
using the shorter hashed key than to find it using
the original value. It is also used in
many encryption algorithms.
Hashing

When the key is a string, we generally use the
ASCII values of its characters in someway:
Examples for k = c1c2c3...cx
 h(k) = (c1128(x-1)+c2128(x-2)+...+cx128*0) mod m
 h(k) = (c1+c2+...+cx) mod m
 h(k) = (h1(c1)+h2(c2)+...hx(cx)) mod m, where
each hi is an independent hash function.

Finding A Hash Function

Goal: scramble the keys.


Ex: Vatandaşlık Numarası for 10000 person




Bad: The Whole Number Since 10000 will not be used forever
Better: last three digits. But every number is even
The Best : Use 2,3,4,5 digits
Ex: date of birth.



Each table position equally likely for each key.
Bad: first three digits of birth year.
Better: birthday.
Ex: phone numbers.


Bad: first three digits.
Better: last three digits.
Hash Function
Truncation
Ignore part of the key and use the remaining part
directly as the index.
 Example: if the keys are 8-digit numbers and the hash
table has 1000 entries, then the first, fourth and
eighth digit could make the hash function.
 Not a very good method : does not distribute keys
uniformly

Hash Function
Folding
Break up the key in parts and combine them in some
way
 Example : if the keys are 9 digit numbers, break up a
key into three 3-digit numbers and add them up.

Ex ISBN 0-321-37319-7
 Divide them to three as 321 373 and 197
 Add them : 891 use it as mod 500 = 491

Hash Function
Middle square
Compute k*k and pick some digits from the
resulting number
 Example : given a 9-digit key k, and a hash table of
size 1000 pick three digits from the middle of the
number k*k.
 Ex 175344387 – 344*344= 118336 -----183 or 833
 Works fairly well in practice if the keys do not have
many leading or trailing zeroes.

Hash Function
Division
h(k)=k mod m
 Fast
 Not all values of m are suitable for this. For example
powers of 2 should be avoided because
then k mod m is just the least significant digits of k
 Good values for m are prime numbers .

Hash Function
Multiplication









h(k)=int(m *(k * c - int(k * c) ) , 0<c<1
In English :
Multiply the key k by a constant c, 0<c<1
Take the fractional part of k * c
Multiply that by m
Take the floor of the result
The value of m does not make a difference
Some values of c work better than others
A good value for c : 5  1
2
Hash Function

Multiplication









Example:
Suppose the size of the table, m, is 1301.
For k=1234, h(k)=850
For k=1235, h(k)=353
For k=1236, h(k)=115
For k=1237, h(k)=660
For k=1238, h(k)=164
For k=1239, h(k)=968
For k=1240, h(k)=471
Hash Function

Universal Hashing





Worst-case scenario: The chosen keys all hash to the same
slot.
This can be avoided if the hash function is not fixed:
Start with a collection of hash functions with the property
that for any given set of inputs they will scatter the inputs
among the range of the function well
Select one at random and use that.
Good performance on average: the probability that the
randomly chosen hash function exhibits the worst-case
behavior is very low.
When Collusion Occurs...

Collusion Occurs when more than one item has been mapped to the same
location



Ex n = 10 m = 10 Use mod 10
9 will be mapped to 9
769 will be mapped to 9

In probability theory, the birthday problem or birthday paradox pertains to
the probability that in a set of randomly chosen people some pair of them will
have the same birthday. In a group of 23 (or more) randomly chosen people,
there is more than 50% probability that some pair of them will both have
been born on the same day. For 57 or more people, the probability is more
than 99%, reaching 100% as the number of people reaches 366. The
mathematics behind this problem leads to a well-known cryptographic attack
called the birthday attack.

When collusion occurs an algorithm has to map the second, third, ...n’th item
to a definitive places in the map
In order to read data from the map the same algorithm has been used to
retrieve it.

Resolving Collusion
Chaining
Put all the elements that collide in a chain (list)
attached to the slot.
 The hash table is an array of linked lists
 The load factor indicates the average number of
elements stored in a chain. It could be less than,
equal to, or larger than 1.

What is Load Factor?



Given a hash table of size m, and n elements
stored in it, we define the load factor of the
table as =n/m (lambda)
The load factor gives us an indication of how
full the table is.
The possible values of the load factor depend on
the method we use for resolving collisions.
Return to Resolving Collision Chaining ctd.

Chaining puts elements that hash to the same
slot in a linked list
•Separate chaining: array of M linked lists.
•Hash: map key to integer i between 0 and
M-1.
•Insert: put at front of ith chain.
constant time
•Search: only need to search ith chain.
proportional to length of chain
Chaining

Insert/Delete/Lookup in expected O(1)time


Keep the list doubly-linked to facilitate deletions
Worst case of lookup time is linear.
However, this assumes that the chains are kept
small.
 If the chains start becoming too long, the table must
be enlarged and all the keys rehashed.

Chaining Performance

Search cost is proportional to length of chain.



Theorem. Let λ= N / M > 1 be average length of list
which is called loading factor.


Trivial: average length = N / M.
Worst case: all keys hash to same chain.
Average search cost : 1+ λ/2
What is the choice of M



M too large too many empty chains.
M too small chains too long.
Typical choice: = N / M ~ 10 constant-time search/insert.
Chaining Performance

Analysis of successful search:

Expected number e of elements examined during a
successful search for key k
= one more than the expected number of elements
examined when k was inserted.
it makes no difference whether we insert at the
beginning or the end of the list.
 Take the average, over the n items in the table, of 1
plus the expected length of the chain to which the ith
element was added:

Open Addressing
Open addressing
 Store all elements within the table
 The space we save from the chain pointers is used
instead to make the array larger.
 If there is a collision, probe the table in a systematic
way to find an empty slot.
 If the table fills up, we need to enlarge it and rehash
all the keys.
Open Addressing




hash function: (h(k) + i ) mod m for i=0, 1,...,m-1
Insert : Start with the location where the key hashed
and do a sequential search for an empty slot.
Search : Start with the location where the key hashed
and do a sequential search until you either find the
key(success) or find an empty slot (failure).
Delete : (lazy deletion) follow same route but mark slot
as DELETED rather than EMPTY, otherwise sub
sequent searches will fail.
Hash Table without Linked-List







Linear probing: array of size M.
Hash: map key to integer i between 0 and M-1.
Insert: put in slot i if free, if not try i+1, i+2, etc.
Search: search slot i, if occupied but no match, try i+1,
i+2, etc.
Cluster.
Contiguous block of items.
Search through cluster using elementary algorithm for
arrays.
Open Address Lineer Probing

Advantage: very easy to implement



Disadvantage: primary clustering
Long sequences of used slots build up with gaps between
them. Every insertion requires several probes and adds to the
cluster.
The average length of a probe sequence when inserting is
1
1 
1 

2  1   2 
Quadratic Probes

Probe the table at slots (h(k) + i2) mod m
for i =0, 1,2, 3, ..., m-1
Ease of computation:
 Not as easy as linear probing.


Do we really have to compute a power?
Clustering
 Primary clustering is avoided, since the probes are not
sequential.

Search Quadratic Probing
Probe
sequence for hash value 3 in a table of size 16:
3 + 0^2 = 3
3 + 1^2 = 4
3 + 2^2 = 7
3 + 3^2 = 12
3 + 4^2 = 3
3 + 5^2 = 12
3 + 6^2 = 7
3 + 7^2 = 4
3 + 8^2 = 3
3 + 9^2 = 4
3 + 10^2 = 7
3 + 11^2 = 12
3 + 12^2 = 3
3 + 13^2 = 12
3 + 14^2 = 7
3 + 15^2 = 4
Quadrature Probing

Probe sequence for hash value 3 in a table of size 19:
3 + 0^2 = 3
3 + 1^2 = 4
3 + 2^2 = 7
3 + 32 = 12
3 + 42 = 0
3 + 52 = 9
3 + 62 = 1
3 + 72 = 14
3 + 82 = 10
3 + 92 = 8
Quadrature Probing
Disadvantage: secondary clustering:
 if h(k1)==h(k2) the probing sequences for k1 and
k2 are exactly the same.
 Is this really bad?

In practice, not so much
 It becomes an issue when the load factor is high.

Double Hashing


The hash function is (h(k)+i h2(k)) mod m
In English: use a second hash function to obtain the
next slot.

The probing sequence is:

h(k), h(k)+h2(k), h(k)+2h2(k), h(k)+3h3(k), ...

Performance :



Much better than linear or quadratic probing.
Does not suffer from clustering
BUT requires computation of a second function
Double Hashing

The choice of h2(k) is important


It must never evaluate to zero
consider h2(k)=k mod 9
for k=81
The choice of m is important
 If it is not prime, we may run out of alternate
locations very fast.

Rehashing


After 70% of table is full, double the size of the
hash table.
Don’t forget to have prime number
Lempel-Ziv-Welch (LZW) Compression Algorithm

Introduction to the LZW Algorithm

Example 1: Encoding using LZW

Example 2: Decoding using LZW

LZW: Concluding Notes
Introduction to LZW

As mentioned earlier, static coding schemes require some
knowledge about the data before encoding takes place.

Universal coding schemes, like LZW, do not require
advance knowledge and can build such knowledge on-thefly.

LZW is the foremost technique for general purpose data
compression due to its simplicity and versatility.

It is the basis of many PC utilities that claim to “double the
capacity of your hard drive”

LZW compression uses a code table, with 4096 as a
common choice for the number of table entries.
Introduction to LZW (cont'd)

Codes 0-255 in the code table are always assigned to
represent single bytes from the input file.

When encoding begins the code table contains only the first
256 entries, with the remainder of the table being blanks.

Compression is achieved by using codes 256 through 4095
to represent sequences of bytes.

As the encoding continues, LZW identifies repeated
sequences in the data, and adds them to the code table.

Decoding is achieved by taking each code from the
compressed file, and translating it through the code table
to find what character or characters it represents.
LZW Encoding Algorithm
1 Initialize table with single character strings
2 P = first input character
3 WHILE not end of input stream
4
C = next input character
5
IF P + C is in the string table
6
P=P+C
7
ELSE
8
output the code for P
9
add P + C to the string table
10
P=C
11
END WHILE
12 output code for P
Example 1: Compression using LZW
Example 1: Use the LZW algorithm to compress the string
BABAABAAA
Example 1: LZW Compression Step 1
BABAABAAA
P=A
C=empty
ENCODER
OUTPUT
output code representing
STRING
codeword
TABLE
string
66
256
BA
B
Example 1: LZW Compression Step 2
BABAABAAA
P=B
C=empty
ENCODER
OUTPUT
output code representing
STRING
codeword
TABLE
string
66
B
256
BA
65
A
257
AB
Example 1: LZW Compression Step 3
BABAABAAA
P=A
C=empty
ENCODER
OUTPUT
output code representing
66
B
STRING
codeword
256
TABLE
string
BA
65
256
257
258
AB
BAA
A
BA
Example 1: LZW Compression Step 4
BABAABAAA
P=A
C=empty
ENCODER
OUTPUT
output code representing
STRING
codeword
TABLE
string
66
65
B
A
256
257
BA
AB
256
257
BA
AB
258
259
BAA
ABA
Example 1: LZW Compression Step 5
BABAABAAA
P=A
C=A
ENCODER
OUTPUT
output code representing
66
B
65
A
STRING
codeword
256
257
TABLE
string
BA
AB
256
257
65
258
259
260
BAA
ABA
AA
BA
AB
A
Example 1: LZW Compression Step 6
BABAABAAA
P=AA
C=empty
ENCODER
OUTPUT
output code representing
66
B
65
A
STRING
codeword
256
257
TABLE
string
BA
AB
256
257
65
260
258
259
260
BAA
ABA
AA
BA
AB
A
AA
LZW Decompression

The LZW decompressor creates the same string table
during decompression.

It starts with the first 256 table entries initialized to single
characters.

The string table is updated for each character in the input
stream, except the first one.

Decoding achieved by reading codes and translating them
through the code table being built.
LZW Decompression Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Initialize table with single character strings
OLD = first input code
output translation of OLD
WHILE not end of input stream
NEW = next input code
IF NEW is not in the string table
S = translation of OLD
S=S+C
ELSE
S = translation of NEW
output S
C = first character of S
OLD + C to the string table
OLD = NEW
END WHILE
Example 2: LZW Decompression 1
Example 2: Use LZW to decompress the output sequence of
Example 1:
<66><65><256><257><65><260>.
Example 2: LZW Decompression Step 1
<66><65><256><257><65><260>
ENCODER OUTPUT
string
Old = 65
New = 66
STRING TABLE
codeword
string
B
A
S=A
C=A
256
BA
Example 2: LZW Decompression Step 2
<66><65><256><257><65><260>
ENCODER OUTPUT
string
Old = 256 S = BA
New = 256 C = B
STRING TABLE
codeword
string
B
A
256
BA
BA
257
AB
Example 2: LZW Decompression Step 3
<66><65><256><257><65><260>
ENCODER OUTPUT
string
Old = 257 S = AB
New = 257 C = A
STRING TABLE
codeword
string
B
A
256
BA
BA
257
AB
AB
258
BAA
Example 2: LZW Decompression Step 4
<66><65><256><257><65><260>
ENCODER OUTPUT
string
Old = 65 S = A
New = 65 C = A
STRING TABLE
codeword
string
B
A
256
BA
BA
257
AB
AB
258
BAA
A
259
ABA
Example 2: LZW Decompression Step 5
<66><65><256><257><65><260>
ENCODER OUTPUT
string
Old = 260 S = AA
New = 260 C = A
STRING TABLE
codeword
string
B
A
256
BA
BA
257
AB
AB
258
BAA
A
AA
259
260
ABA
AA
LZW: Some Notes

This algorithm compresses repetitive sequences of data
well.

Since the codewords are 12 bits, any single encoded
character will expand the data size rather than reduce it.

In this example, 72 bits are represented with 72 bits of
data. After a reasonable string table is built, compression
improves dramatically.

Advantages of LZW over Huffman:



LZW requires no prior information about the input data stream.
LZW can compress the input stream in one single pass.
Another advantage of LZW its simplicity, allowing fast execution.
LZW: Limitations


What happens when the dictionary gets too large (i.e., when all the
4096 locations have been used)?
Here are some options usually implemented:

Simply forget about adding any more entries and use the table as
is.

Throw the dictionary away when it reaches a certain size.

Throw the dictionary away when it is no longer effective at
compression.

Clear entries 256-4095 and start building the dictionary again.

Some clever schemes rebuild a string table from the last N
input characters.