Algorithms and data structures 7.11.2015. Protected by http://creativecommons.org/licenses/by-nc-sa/3.0/hr/ Creative Commons  You are free to: share — copy and redistribute the material in any medium.

Download Report

Transcript Algorithms and data structures 7.11.2015. Protected by http://creativecommons.org/licenses/by-nc-sa/3.0/hr/ Creative Commons  You are free to: share — copy and redistribute the material in any medium.

Algorithms and data structures
7.11.2015.
Protected by http://creativecommons.org/licenses/by-nc-sa/3.0/hr/
Creative Commons

You are free to:
share — copy and redistribute the material in any medium or format
 adapt — remix, transform, and build upon the material


Under the following terms:



Attribution — You must give appropriate credit, provide a link to the license, and
indicate if changes were made. You may do so in any reasonable manner, but
not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
ShareAlike — If you remix, transform, or build upon the material, you must
distribute your contributions under the same license as the original.
No additional restrictions — You may not apply legal terms or technological
measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is
permitted by an applicable exception or limitation.
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For
example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
Text copied from http://creativecommons.org/licenses/by-nc-sa/3.0/
Algorithms and data structures, FER
7.11.2015.
2 / 40
Addressing techniques
Basics
Retrieval procedures
Hashing
7.11.2015.
Basics


Knowing the key of some record, the question arises how to find this
record
Primary key

Defines a record uniquely
–

Concatenated (composite) keys

necessary for unique identification of some types of records
–

E.g. StudentID
E.g. StudentID & CourseCode & ExamDate uniquely define a record of examination (a
possible examination by a committee due to student’s complaint is neglected!)
Secondary key

Need not uniquely define the record, but it points at some attribute value
–
E.g. YearOfStudy in the record with course data
Algorithms and data structures, FER
7.11.2015.
4 / 40
Sequential search

Searching the file record by record - the most primitive way
It is used for sequential files where all the records have to be read anyhow
 Other terms: linear, serial search
 The records need not be sorted

Repeat for all the records
If the current record equals the searched one
Record is found
Leave the loop


On average n/2 records are read
Complexity: O(n)
–
–
IspisiTrazi(PrintSearch)
Best case: O(1)
Worst case: O(n)
Algorithms and data structures, FER
7.11.2015.
5 / 40
Sequential searching of sorted records

How to improve the sequential search?

Sort the records according to a key!
Sort the records
Repeat for all records
If the current record equals the searched one
Record is found
Leave the loop
If the current record is larger than the searched one
Record does not exist
Leave the loop

What are the complexities in the best, worst and average case while
searching sorted records?
Algorithms and data structures, FER
7.11.2015.
6 / 40
Questions and exercises
1.
A colleague tells you that s/he wrote a sequential search algorithm
with complexity O(log n). Shall you congratulate him or laugh at
him? 
2.
In the best case, the record will be found after the minimum
number of comparisons. Where was this record located?
3.
Where was the record found after the maximum number of
comparisons?
Algorithms and data structures, FER
7.11.2015.
7 / 40
Block wise reading

In direct/random access files (all the records are of the same
length!) sorted by the primary key, it is not indispensable to check all
the records
 E.g. only each hundredth record is examined
 When the block of the record with the searched key is located,
the block is searched sequentially

How to find the optimal block size?
Algorithms and data structures, FER
7.11.2015.
8 / 40
Example – places in Croatia
We are looking for the city of Malinska in a list of F=6935 places;
each page contains B=60 places


There are F / B leading records (and corresponding pages) - F / B = 116
115
1
Ada
Adamovec
Adžamovci
.
.
.
Bair
Bajagić
Bajčići
2
Bajići
Bajkini
Bakar-dio
.
.
.
Barilović
Barkovići
Barlabaševec
Algorithms and data structures, FER
60
Mali Gradac
Mali Grđevac
Mali Iž
.
Malinska
.
Manja Vas
Manjadvorci
Manjerovići
61
Maovice
Maovice
Maračići
.
.
.
Martin
Martina
Martinac
7.11.2015.
Zvijerci
Zvjerinac
Zvoneća
.
.
.
Žitomir
Živaja
Živike
116
Živković Kosa
Živogošće
Žlebec Gorički
.
Žutnica
Žužići
Žužići
 CitanjePoBlokovima
(ReadingBlockWise)
9 / 40
Optimal block size

In case of F records, and the block size B, there are F / B leading
records and blocks

It is expected that during the search by blocks it is necessary to read a half
of the existing leading records of the blocks
On average, the searched leading record will be found after having read
(F / B) / 2 = F /( 2 B) records
–



Within the located block there are B records so it can be expected that the
searched record would be found after on average B / 2 (sequential!)
readings within that block
The total expected number of readings is F / (2 B) + B / 2
After the derivation by B is equalled to zero, optimal block size is obtained:
B = √F

What is the optimal block size for the list of places in Croatia?
Algorithms and data structures, FER
7.11.2015.
10 / 40
Binary search

The binary search starts at the half of the file/array and continues with constant halving of the
interval where the searched record could be found
 prerequisite: data sorted!
 Average number of searching steps log2 n
 The procedure is inappropriate for disk memories with direct access due to time/consuming
positioning of reading heads. It is recommendable for core memory.
 The fact is used that the array is sorted and in each step the search area is halved

Number of elements = n
Complexity is O(log2n)
Search steps
= log2n
Algorithms and data structures, FER
7.11.2015.
The
searched
element
11 / 40
Example of binary search

Looking for 25
2 5 6 8 9 12 15 21 23 25 31 39
Algorithms and data structures, FER
7.11.2015.
12 / 40
Algorithm for binary search
lower_bound = 0
upper_bound = total_elements_count
Repeat
Find the middle record
If the middle record equals to the searched one
Record is found
Leave the loop
If the lower bound is greater than or equal to the upper bound
Record is not found
Leave the loop
If the middle record is smaller then the searched one
Set the lower bound to the position of the current record + 1
If the middle record is larger then the searched one
Set the upper bound to the position of the current record - 1
 (BinarnoPretrazivanje) BinarnySearch
Algorithms and data structures, FER
7.11.2015.
13 / 40
Questions
1.
What are the execution times in binary search of n records for the
best, worst and average case?
Average case is just for 1 step simpler than the worst case. The proof can be seen on:
http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm, March 24th, 2014
2.
For the search of places in Croatia (list of 6935 places), what is the
maximum number of steps necessary to locate the searched
place?
3.
Shall the binary search be always faster then the sequential, even
for a large set of data?
Algorithms and data structures, FER
7.11.2015.
14 / 40
Problem

Suppose that n unsorted data in a set can be sorted in time O(n
log2 n) . You have to perform n searches in this data set. What is
better:

To use the sequential search?
To sort data and then apply the binary search?

Solution: it makes more sense to sort and then search binary!

–
n sequential searches: n * O(n) = O(n2)
–
sort + binary: O(n log2 n) + n * O (log2 n) = O(n
log2 n)
O(n log2 n) < O(n2)
Algorithms and data structures, FER
7.11.2015.
15 / 40
Index-sequential files



Every record contains a key as a unique identifier
If a file is sorted by the key, it is appropriate to form a look-up table
 Input for the table is the key of the searched record
 Output is the information regarding the more precise location of the searched record
Such a table is called index
 The index need not point to each record but only to a block
–
In the example – places in Croatia, it was shown that the optimal block size was √F
For large files there are indices on multiple levels – with the optimal
organisation, in the worst case in each of the indices and in the data file, the
same number of records is read
 Optimal sizes of indices on 2 levels are 3√F i 3√F 2
O (3√F )
 For k levels: k+1√F , k+1√F2, k+1√F3 ,… k+1 √F k-1 , k+1 √F k
O (k+1√F )

Algorithms and data structures, FER
7.11.2015.
16 / 40
Index-sequential files


Insertion and deletion
 In traditional sequential file (magnetic tape), they are possible only in the way by copying
the whole file with addition and/or deletion of records
 In direct (random) access files, deletion is performed logically, i.e. a tag is written to mark a
deleted record
After a certain number of revisions, the file has to be reorganised
 Data are written sequentially and by the key value, indexing is repeated (so called file
maintenance)
Algorithms and data structures, FER
7.11.2015.
17 / 40
Search procedures

Index non-sequential files



If the search by multiple keys is required or if addition and deletion of records is frequent
(volatile files), in the first case it is very difficult (if not impossible) and in the latter case
difficult, to maintain the requirement that the records are sorted and contained within their
initial block
In that case, the index should contain the address (relative or absolute) of each single
record
The key contains the address

The simplest case is to form the key so that some part of it contains the
record address
–
–
E.g. At some entrance examination for enrolment, the application number can serve as
the key, and simultaneously it can be the ordinal number of the record in a direct
access file
Very often, such a simple procedure is not possible because the coding scheme
cannot be adapted to each single application
Algorithms and data structures, FER
7.11.2015.
18 / 40
The idea of hashing

Problem: a company employs about hundred thousand employees,
every person has his or her own unique identifier ID, generated from
the interval [0, 1 000 000]. The records read & write must be fast.
How to organise the file?

Direct file with the key equals the ID?
–

1 000 000 x 4 bytes~ 4MB, 90% of space is unused!
It is possible to devise procedures to transform the key into address,
or, even better, into some ordinal number
 The position of the record is stored under this ordinal number
 This modification improves the flexibility
Algorithms and data structures, FER
7.11.2015.
19 / 40
Hashing




Let us suppose to have M buckets (blocks of records with the same starting
address of the block) available
A pseudo-random number from the interval 0 to M-1 is calculated from the key
value using a hash-function
This number is the address of a group of data (of a bucket) where all of the
respective keys are transformed into the same pseudo-random number
 Collision happens when two different keys are transformed into a same
address
 If a bucket is full, it is possible to insert a pointer to the overflow area, or
insertion is attempted in the next neighbouring bucket - Bad neighbour
policy
In hashing the following parameters can vary:


Bucket capacity
Packing density
Algorithms and data structures, FER
7.11.2015.
20 / 40
Example

Store the names into a hash table

hash-function = (sum of ASCII codes) % (number of buckets)
0
1
2
3
Vanja
Matija
Andrea
Doris
Saša
Alex
?
Sandi
Perica
Iva
Algorithms and data structures, FER
7.11.2015.
21 / 40
Bucket capacity


A pseudo-random number is generated through a transformation of
the key, yielding the bucket address
 If the bucket capacity equals 1, overflow is frequent
 With increasing of the bucket size, decreses the probability of
overflow, but reading of a single bucket is more time consuming
and the amount of sequential search within a bucket increases
It is recommendable to match the bucket size with the physical size
of the record on external memory (disk block)
Algorithms and data structures, FER
7.11.2015.
22 / 40
Packing density

After the bucket size has been chosen, the packing density can be
selected, i.e. the number of buckets to store the foreseen number of
records
 To reduce the number of overflows, a larger capacity is chosen
 Packing density = number of records / total capacity
–
–
–

N = number of records to be stored
M = number of buckets
C = number of records within a bucket
Packing density = N / (M *C)
Algorithms and data structures, FER
7.11.2015.
23 / 40
How to deal with overflow?”

Using the primary area
 If a bucket is full, use the next one, etc.
 After the last one, comes the first one
 Efficient if the bucket size exceeds 10

Separate chaining

Buckets are organised as linear lists
Algorithms and data structures, FER
7.11.2015.
24 / 40
Statistics of hashing

Let M be the number of buckets, and N the amount of input data. The probability for directing x
records into a certain bucket obeys the binomial distribution:
x
N!
1 
 1  
Px  
   1  
x!N  x !  M   M 


N x
probability for Y overflows: P(C + Y)
the expected number of overflows from a given bucket:

s   PC  Y  Y 
Y 1



Total expected number of overflows: 100s M/N
The average number of records to be entered into the hash table before collision ~ 1.25 √M
The average total number of entered records before every bucket contains at least 1 record is
M ln M
Algorithms and data structures, FER
7.11.2015.
25 / 40
Transformation of the key into address

Generally, the key is transformed into the address in 3 steps:
If the key is not numeric, it should be transformed into a number, preferably
without loss of information
 An algorithm is applied to transform the key, as uniformly as possible, into a
pseudo-random number with an order of magnitude of the bucket count
 The result is multiplied by a constant  1 for transformation into an interval
of relative addresses, equal to the number of buckets

–

relative addresses are converted into the absolute ones on a concrete physical unit
and, as a rule, that is the task of system programs
An ideal transformation: probability of transforming 2 different keys
in a table of size M into the same address is 1/M
Algorithms and data structures, FER
7.11.2015.
26 / 40
Characteristics of a good transformation





The output value depends only on the input data
 If it were dependent also on some other variable, its value should be also
known at the search
The function uses all the information from the input data
 If it were not using it, under a small variation of the input data a large
number of equal outputs would be achieved – the distribution would depart
from the wished one
Uniformly distributes the output values
 Otherwise efficiency is decreased
For similar input data, results in very different output values
 In reality, the input data are often very similar, while a uniform distribution at
the output is required
Which of these requirements are not fulfilled by our hash function from the
previous example?
Algorithms and data structures, FER
7.11.2015.
27 / 40
Usage of hashing

When is it appropriate?
 Compilers use it for recording of declared variables
 For spelling checkers and dictionaries
 In games to store the positions of players
 For equality checks (in information security)
–
If two elements result in different hash values, they must be different
 When quick and frequent search is required

When is it not appropriate?


When records are searched by a non-key attribute value
When the data need to be sorted
–
E.g. To find the minimum value of the key
Algorithms and data structures, FER
7.11.2015.
28 / 40
Example: Transformation of key into address


6 digit key, 7000 buckets; key: 172148
Method: middle digits of the key squared

Key squared yields a 12 digits number. Digits from the 5th to the 8th are
used
1721482 = 029634933904
The middle 4 digits should be transformed into the interval [0, 6999]
– As the pseudo-random number obtains values from the interval [0,
9999], while the bucket addresses can be from the interval [0, 6999], the
multiplication factor is 6999/9999  0.7
 Bucket address = 3493 * 0.7 = 2445
 The results correspond to behaviour of the roulette

Algorithms and data structures, FER
7.11.2015.
29 / 40
Methods for transformation of key into address


Root from the middle digits of the key squared
 Like in the previous example but after the square operation the root from
the 8 middle digits is calculated and cut-off to obtain a four digit integer:
 sqrt (96349339) = 9815
Division
 The key is divided by a prime number slightly less or equal to the number of
buckets (e.g. 6997)
 Remainder after division is the bucket address
–
Bucket address = 172148 mod (6997) = 4220
Keys having a sequence of values are well distributed
Shifting of digits and addition



E.g. key= 17207359
– 1720 + 7359 = 9079
Algorithms and data structures, FER
7.11.2015.
30 / 40
Methods for transformation of key into address

Overlapping
 Overlapping resembles shifting, but is more appropriate for long keys
 E.g. the key= 172407359
–

407 + 953 + 271 = 1631
Change of the counting base
 The number is calculated as if having another counting base B
 E.g. B = 11, key = 172148
–
1*115 + 7*114 + 2*113 + 1*112 + 4*111 + 8*110 = 266373
The necessary number of least significant digits is selected and
transformed into the address interval: bucket address = 6373 * 0.7 = 4461
The best method can be chosen after simulation for a concrete application
 Generally, division gives the best results


Algorithms and data structures, FER
7.11.2015.
31 / 40
Determination of parameters

Example:


There are 350 students enrolled in a study. Their ID - identification number
(11 characters) and the family name (14 characters) should be stored,
under the requirement to retrieve the records fast using their ID.
Remark:

ID contains 11 digits
– The last digit is for control and it can but need not be stored, if the rule
how to calculate it from the rest of the digits is known
Algorithms and data structures, FER
7.11.2015.
32 / 40
Solution


A record contains11+1 + 14+1 = 27 bytes
Let the physical block on the disk be 512 bytes

The bucket size should be equal or less than that
–


Therefore, a bucket shall contain data for 18 students and 26 bytes of
unused space
A somewhat larger table capacity, e.g. 30%, shall be provided to reduce the
number of expected overflows
–
–

512/27 = 18,963
That means there are 350/18 *1.3 = 25 buckets
ID should be transformed into a bucket address from the interval [0, 24]
ID is rather long, so overlapping can be considered
–
Methods can be combined – after overlapping division can be performed
–
Address shall be calculated by division with a prime number close to the bucket number, e.g. 23
Algorithms and data structures, FER
7.11.2015.
33 / 40
Writing of records into buckets of the hash table
ID
C
FamilyName
0
1
HASH
M
M-1
BLOCK on disk
Algorithms and data structures, FER
7.11.2015.
34 / 40
Examples of key transformation

ID = 5702926036x
2926
630
075
3631 mod (23) = 20
ID = 6702926036x
2926
630
076
3632 mod (23) = 21
ID = 6702926037x
2926
730
076
3732 mod (23) = 6
ID = 5702926037x
2926
730
075
3731 mod (23) = 5
If the bucket is full, entering into the next bucket is attempted
cyclically (bad neighbour policy)
Algorithms and data structures, FER
7.11.2015.
35 / 40
Algorithm - 1
Create an empty table on a disk
Read ID and family name sequentially, until there are data
If the control digit is not correct
“Incorrect ID"
Else
Set the tag that the record is not written
Calculate the bucket address
Remember it as the initial address
Repeat
Read the existing records from the bucket
Repeat for all the records from the bucket
If the record is not empty
If the already written ID is equal to the input one
“Record already exists"
Put the tag that the record is written
Leave the loop
Else
Algorithms and data structures, FER
7.11.2015.
36 / 40
Algorithm - 2
Write the input record
Put the tag that the record is written
Leave the loop
If the record is not written
Increment the bucket address for 1 and calculate the
modulo(buckets count)
If the achieved address equals the initial
Table is full
Until record written or table full
End
 Hash
Algorithms and data structures, FER
7.11.2015.
37 / 40
Exercises



Update  Hash such that instead JMBG (13 character ID) it uses OIB (11
character ID). Update the function “Kontrola” for checking the control digit.
Implement deletion by ID.
Let products be the name of the unformatted file organised using hashing.
Each record consists of ID (int), name (50+1 char), quantity (int) and price
(float). The record is considered empty if its ID equals zero. Count the number
of non-empty records in a file. The size of the block on the disk is BLOCK, and
expected maximum number of records is MAXREC. These parameters are
contained in parameters.h.
Let an unformatted file be organised using hashing. Each record of the file
consists of name (50+1 char), quantity (int) and price (float). The record is
considered empty if its quantity equals zero. The size of the block on the disk is
BLOCK what is contained in parameters.h. Write the function that will find the
packing density. The prototype of the function is:
float density (const char *file_name);
Algorithms and data structures, FER
7.11.2015.
38 / 40
Exercises

Write the function for the insertion of the ID (int) and the name (char 20+1) into
the memory resident hash table with 500 buckets. Each bucket consists of a
single record. If the bucket is full, the next bucket is used (cyclically). The input
arguments are the bucket address (previously computed), ID and name. The
function returns 1 if the insertion is successful, 0 if the record already exists,
and -1 if the table is full so the record cannot be inserted.

Write the function that will find the ID (int) and the company name (30+1) in the
memory resident hash table with 200 buckets. Each bucket consists of a single
record. If the bucket is full and does not contain the searched key value, the
next bucket is used (cyclically). Input arguments are the bucket address
(previously computed) and the ID. The output argument is the company name.
The function returns 1 if the record is found, and 0 if it is not found.
Algorithms and data structures, FER
7.11.2015.
39 / 40
Exercises

Let a key be 7-digit telephone number. Write the function for transformation of
the key into address. Hash table consists of M buckets. The division method
should be used. The prototype of the function is:
int address (int m, long teleph);

Let products be the name of the unformatted file organised using hashing. A
record consists of ID (4-digit integer), name (up to 30 char) and price (float).
The record is considered empty if its ID equals zero. Write the function for
emptying of the file. The size of the block on the disk is BLOCK, and expected
maximum number of records is MAXREC. These parameters are contained in
parameters.h.

Write the function that will compute the bucket address in the table of 500
buckets. The key is 4-digit ID, and the method is the root from the central digits
of the key squared.
Algorithms and data structures, FER
7.11.2015.
40 / 40