Chapter 5 Record Storage & Primary File Organizations

Transcript Chapter 5 Record Storage & Primary File Organizations

Chapter 5 Record Storage & Primary File Organizations Chapter 5 1

Storage • The are two general types of storage media that is used with computers. They are : – Primary Storage - This includes all storage media that can be operated on directly by the CPU (RAM , L1 and L2 Cache Memory) – Secondary Storage - This includes Hard Drives, CD’s and tape.

Chapter 5 2

Memory Hierarchies & Storage Devices • The Memory Hierarchy is based upon speed of access. However, this speed comes with a price tag attached which varies inversely with the access time of memory. Like cars the faster the memory access is the more it costs.

Chapter 5 3

Primary Storage Level of Memory • The Primary Storage Level of Memory is generally made up of 3 Levels.

– L1 Cache which is located on the CPU – L2 Cache which is located near the CPU – Main Memory which is the RAM figure that is often referred to in computer advertisements Chapter 5 4

Secondary Storage Level of Memory • The Secondary Storage Level of Memory may be made up of 4 Levels.

– Flash Memory or EEPROM – Hard Drives – CD ROM’s – Tape Chapter 5 5

Figure 5.1

Chapter 5 6

Terms Used in the Hardware Description of Hard Drives • Capacity - The number of bytes it can store.

• Single-sided vs. Double-sided - States if the disk/platter is written on one or both sides.

• Disk Pack - A collection of disks/platters that are assembled together into a pack.

• Track - A Circle of a small width on a disk. A disk surface will have many tracks.

Chapter 5 7

Terms Used in the Hardware Description of Hard Drives • Sector - A segment or arc of a track.

• Block - is the division of a track into equal sized portions by the operating system.

• Interblock Gaps - These are fixed sized segments that separate the blocks.

• Read/Write Head - Actual reads/writes the information to the disk.

Chapter 5 8

Terms Used in the Hardware Description of Hard Drives • Cylinder - Tracks with the same diameter that are located on the disk surface of a disk pack.

Chapter 5 9

Figure 5.2

Chapter 5 10

Terms Used in Measuring Disk Operations • Seek Time (s)- The time it takes to position the read/write head on the desired track. It will be given in all problems that it is needed for.

• Rotational Delay (rd) - The average amount of time it takes the desired block to rotate into position under the read/write head. Rd=(1/2)*(1/p) min where p is rpm of the disk Chapter 5 11

Terms Used in Measuring Disk Operations • Transfer Rate (tr) - The rate at which information can be transferred to or from the disk. tr =(track size)/(1/p min) • Block Transfer Time (btt) - The time it takes to transfer the data once the read/write head has been positioned. btt = B/tr msec where B is the block size in bytes.

Chapter 5 12

Terms Used in Measuring Disk Operations • Bulk Transfer Rate (btr) - The rate at which multiple blocks can be written/read to contiguous blocks. Where G is the Interblock Gap btr = (B/(B+G)) * tr bytes/msec • Rewrite Time (T rw ) - Time it takes after a block is read to write that same block back to the disk or the time for one revolution.

Chapter 5 13

Computing Times • Given : – Seek Time (s) = 10 msec – Rotational speed = 3600 rpm – Track size = 50 KB – Block size (B) = 512 bytes – Interblock Gap = 128 bytes Chapter 5 14

Problems for Disk Operations • Compute the average time it takes to transfer 1 block on this system.

• Compute the average time it takes to transfer 20 non-contiguous blocks that are located on the same track.

• Compute the average time it takes to transfer 20 contiguous blocks.

Chapter 5 15

Parallelizing Disk Access Using RAID • RAID - Stands for Redundant Arrays of Inexpensive Disks or Redundant Arrays of Independent Disks.

• RAIDs are used to provide increased reliability, increased performance or both.

Chapter 5 16

RAID Levels • Level 0 - has no redundancy and the best write performance but its read performance is not as good as level 1.

• Level 1 - uses mirrored disks which provide redundancy and improved read performance.

• Level 2 - provides redundancy using Hamming Codes Chapter 5 17

RAID Levels • Level 3 - uses a single parity disk.

• Level 4 and 5 - use block-level data striping with level 5 distributing the data across all the disks.

• Level 6 - uses the P + Q redundancy scheme making use of the Reed-Soloman codes to protect against the failure of 2 Disks.

Chapter 5 18

Figure 5.4

Chapter 5 19

Fig 5.5

Chapter 5 20

Fig 5.6

Chapter 5 21

Records • Records is the term used to refer to a number of related values or items. Each value or item is stored in a field of a specific data type.

• Records may be of either fixed or variable lengths.

Chapter 5 22

Variable Length Records in Files • There are several reasons a record with the same record type may be of variable length.

– Variable length fields – Repeating fields • For efficiency reasons different record types may be clustered in a file.

Chapter 5 23

Fig 5.7

Chapter 5 24

Spanned Vs Unspanned Records • When the records in a file is stored on a disk they may be placed in blocks of a fixed size. This will rarely match the record size. So a decision must be made when the record size is smaller than the block size and the block size is not a multiple of the record size whether to store the record all in one block and have unused space or in two different blocks.

Chapter 5 25

Fig 5.8

Chapter 5 26

File Operations • File may either be stored in contiguous blocks or by linking the blocks together. There are advantages and disadvantages to both methods.

• Operations on files can be group into two type of operations. Retrieval or update. Retrieval only involves a read while and update involves read, write and modification.

Chapter 5 27

File Structure • Heap (Pile) Files • Hash (Direct) Files • Ordered (Sorted) Files • B - Trees Chapter 5 28

• Once the data has been brought into memory, it can be accessed by an instruction in .00000004 seconds by a machine running a 25MIPS. The disparity between time for memory access and disk access is enormous:we can perform 625,000 instructions in the time it takes to read /write one disk page. • To put this in human terms if you were typing a letter for you boss and found a word you could not make out so you leave him a voice mail message. Since you were told to do nothing else but this you patiently wait for his reply doing Nothing! Unfortunately, he just went on vacation and does not get your message for 3 WEEKS.

• This is similar to the computer waiting .025 seconds to 29

Heap (Pile) Files (Unordered) • Insertions - Very efficient • Search - Very inefficient (Linear Search) • Deletion - Very inefficient – Lazy Deletion • Problems? • When are they Used?

Chapter 5 30

Ordered (Sorted Files) Records • Records are stored based on the value contained in one of their fields called the

ordering field

• If the ordering field is also a key field than the field is better described as an

ordering key.

Chapter 5 31

Advantages of Ordered Files • Reading of the records in order of the

ordering field

is extremely efficient.

• Finding the next record is fast.

• Finding records based on a query of the ordering field is efficient. (binary search).

• Binary search may be done on the blocks as well.

Chapter 5 32

Disadvantages of Ordered Files • Searches on non-ordering fields are inefficient.

• Insertion and deletion of records are very expensive.

• Solutions to these problems?

Chapter 5 33

Hashing Techniques • This is where a records placement is determined by value in the

hash

field. This value has a

hash or randomizing function

applied to it which yields the address of the disk block where the record is stored. For most records, we need only a single-block access to retrieve that record.

Chapter 5 34

Internal Hashing • Internal Hashing is implemented as a

hash table

through the use of an

array

of records. (In memory) • An array index range of

0 to M-1

. A function that transforms the hash field value into an integer between

0 to M-1

is used. A common one is

h(K) =K mod M.

Chapter 5 35

Internal Hashing (con’t) •

Collisions

occur when a hash field value of a record being inserted hashes to an address that already contains a different record.

• The process of finding another position for this record is called

collision resolution

Chapter 5 36

Collision Resolution • •

Open Addressing

- Places the record to be inserted in the

first available

position subsequent to the hash address.

Chaining

- A pointer field is added to each record location. When an overflow occurs this pointer is set to point to overflow blocks making a

linked list

Chapter 5 37

Collision Resolution (con’t) •

Multiple hashing

- If an overflow occurs a

second

hash function is used to find a new location. If that location is also filled either another hash function is applied or open addressing is used.

Chapter 5 38

Fig 5.10 Page 140 Chapter 5 39

Goals of the Hash Function • The goals of a good hash function are to

uniformly distribute

the records over the address space while minimizing collisions to

avoid wasting space

• Research has shown – 70% to 90% fill ratio best.

– That when uses a Mod function M should be a

prime

number.

Chapter 5 40

External Hashing for Disk Files • External hashing makes use of

buckets

, each of which can hold

multiple records

• A

bucket

is either a

block or a cluster of contiguous blocks

• The hash function maps a key into a relative

bucket number

, rather than an

absolute block address

for the bucket.

Chapter 5 41

Types of External Hashing • Using a fixed address space is called static hashing.

• Dynamically changing address space: – Extendible hashing* – Linear hashing** * With a Directory ** Without a Directory Chapter 5 42

Static Hashing • Under Static Hashing a

fixed

number of buckets

(M)

is allocated.

• Based on the hash value a

bucket number

is determined in the block directory array which yields the

block address

• If

records fit into each block. This method allows up to

n*M

records to be stored.

Chapter 5 43

Fig 5.11 Page 143 Chapter 5 44

Fig 5.12 Page 144 Chapter 5 45

Extendible Hashing • In Extendible Hashing, a type of directory is maintained as an

array of 2 d bucket addresses.

Where

refers to the first

high (left most) order bits and is referred to as the

global depth

of the directory. However, there does

NOT

have to be a DISTINCT bucket for each directory entry.

• A local depth

d’

is stored with each bucket to indicate the number of bits used for that bucket.

Chapter 5 46

Figure 5.13 Page 146 Chapter 5 47

Overflow (Bucket Splitting) • When an overflow in a bucket occurs that bucket is

split

. This is done by

dynamically allocating a new bucket

and

redistributing

the contents of the old bucket between the old and new buckets based on the increased local depth

d’+1

of both these buckets. Chapter 5 48

Overflow (Bucket Splitting) • Now the new bucket’s address must be added to the directory.

• If the overflow occurred in a bucket whose current local depth

d’

is less than or equal to the global depth

adjust the directory entries accordingly. (No change in the directory size is made.) Chapter 5 49

Overflow (Bucket Splitting) • If the overflow occurred in a bucket whose current local depth

d’

is now greater than the global depth

you must increase the global depth accordingly.

• This results in a

doubling

of the directory size for each time

is increased by 1 and appropriate adjustment of the entries.

Chapter 5 50

Slide showing how buckets are split under Extendible Hashing.

Chapter 5 51

Shrinking Extendible Hashing Files • The generally used principal for shrinking extendible hashing files is that when

d > d’

for all buckets after a deletion occurs.

• Buckets may be combined when the each of the buckets to be combined are less than half full and have the same bit pattern with the exception of the

d’

bit. I.e.

d’

= 3 and the bit patterns of

110

and

111

Chapter 5 52

Linear Hashing •

Linear Hashing

allows the hash file to expand and shrink its number of buckets dynamically

without

needing a directory.

• It starts with

buckets numbered

0 to M-1

and use the mod hash function

h(K)= K mod M

as the initial hash function called

h i

Chapter 5 53

Linear Hashing (Con’t) • Overflow is handled by chaining

individual

overflow chains for

each bucket

• It works by methodically splitting the original buckets;

starting with bucket 0,

redistributing the contents of

bucket 0

between

bucket 0

and

bucket M

(the new bucket) using a

secondary

hash function:

h i+1 (K) = K mod 2M

Chapter 5 54

Linear Hashing (Con’t) • This splitting of buckets is done in order

(0,1,…,M-1) REGARDLESS

of which bucket the collision occurred. To keep track of the next bucket to be split we will use

would be incremented to

• When a record hashes to a bucket less than

we use the

secondary hash function

to determine which of the two buckets it belongs in.

Chapter 5 55

Linear Hashing (Con’t) • When all of the original

buckets have been split and we have

buckets and

n = M

• We reset

M to 2M

n to 0

and change our

secondary

hash function to our

primary

hash function.

• Shrinking of the file is done based on the load factor using the reverse of splitting.

Chapter 5 56

Slide showing how to split using linear hashing.

Chapter 5 57