Organizing Files for Performance

Download Report

Transcript Organizing Files for Performance

Organizing Files for Performance

Chapter 6 Jim Skon File Processing - Organizing file for Performance MVNC 1

Organizing Files for Performance

 Data Compression  Reclaiming space in files  Fast Searching  Keysorting File Processing - Organizing file for Performance MVNC 2

Data Compression

 Making files

smaller

» » » Use less storage, save space Faster Transmission Processed faster  Data Compression » » encoding information more efficiently Many techniques exist File Processing - Organizing file for Performance MVNC 3

Data Compression

 Consider fields with fixed length or fixed set of values  A binary representation can save space » » States - 50 states - 6 bits (one byte) Zip - 0 to 99999. 17 bits (three bytes)  Called Compact Notation » Redundancy reduction File Processing - Organizing file for Performance MVNC 4

Data Compression

 Cost of binary representations » » » file not readable as test Processing time for conversion All software must including appropriate/compatable encoding and decoding routines.

» Potential lost of flexibility File Processing - Organizing file for Performance MVNC 5

Data Compression

 Suppressing repreating sequences » Consider a picture – Series of pixels - each a color – Colors represented by 8 bit value – usually come in bunches, e.g.

– 24 23 22 22 22 22 22 25 25 25 25 25 25 65 65 66 66 66 66 » Run length encoding – Represent long runs with a prefix (FF) follwed by count, followed by color – 24 23 FF 05 22 FF 06 25 65 65 FF 04 66 » Simple images would be small, busy images would be no bigger. File Processing - Organizing file for Performance MVNC 6

Data Compression

 Assigning variable length codes » » Some codes are more likely then others Use shorter codes for often used values, longer ones for less used values.

» Each code must have the property of a unique prefix – No code is the prefix of any other code – Thus we always know if we are at the end of a given code File Processing - Organizing file for Performance MVNC 7

Variable length codes

 Example: Letter: Prob: Code: a 0.4

1 b 0.1

010 c 0.1

011 d 0.1

e 0.1

f 0.1

g 0.1

0000 0001 0010 0011  Can be decoded with a binary tree!

 Called Huffman code » Algorithm exists to easily create optimal code » » » Requires that a table of codes be mainted with file Most often used for fixed codes Example - Type 3 FAX File Processing - Organizing file for Performance MVNC 8

Data Compression

 Irreversible Compression » » Compression which losses some information Example - compress a 400x400 image into a 100x100 image by averaging groups of 16 adjacent pixels » » Saves space, but resolution of picture reduced Used most often for visual or audio information (which has inherient redundancy) File Processing - Organizing file for Performance MVNC 9

Data Compression

 Compression in UNIX »

pack

and

unpack

programs – Uses Huffman coding – 25% to 40% savings on text files – much less on binary files – Uses “.z” file prefix »

compress

and

uncompress

programs – Uses Lempel-Ziv compression – No coding table needed - self coding – Uses “.Z” file prefix File Processing - Organizing file for Performance MVNC 10

Reclaiming space in files

 Suppose a variable length record in the middle of a file is modified so it is: » » Longer?

Shorter?

 Suppose a record is » Added to to the middle?

» Deleted from middle?

File Processing - Organizing file for Performance MVNC 11

Reclaiming space in files

 Record deletion and storage compaction  storage compaction » recovering unused space in a file » from deletion or from record size changing  Consider deleted records » » Must be able to recognize deleted records Have a special mark for record – e,g, asterisk in first charater in key field – May be undeleted if not overwritten!

File Processing - Organizing file for Performance MVNC 12

Dealing with Deleted records

 Occasional compaction  Dynamic maintanance File Processing - Organizing file for Performance MVNC 13

Occasional compaction

 A process periodically run which reads file, and rewrites with no empty space.

 Could happen every night automactically every night/week/month  File unavailable while operation underway.

File Processing - Organizing file for Performance MVNC 14

Dynamic maintanance

 Delete records by marking  Reuse deleted records a new records added, updated  Need: » » Way of knowing if deleted records exist Where deleted records are so we can jump right to them File Processing - Organizing file for Performance MVNC 15

Dynamic maintanance

 Solution: linked list of deleted records » Each deleted record contains a mark, and a pointer to the next deleted record » The file header contains a pointer to the first deleted record.

File Processing - Organizing file for Performance MVNC 16

Linked list of deleted records

 Fixed-length records  Variable-length records File Processing - Organizing file for Performance MVNC 17

Linked list of deleted records

 Fixed-length records » Simply maintain a stack of deleted records rooted in header record » » » Deletion - add to front of list Addition - use record at front of list Minimal list maintanance cost File Processing - Organizing file for Performance MVNC 18

Linked list of deleted records

 Variable-length records » Store for each deleted record – Deletion Marker – link to nect deleted record – record size indicator File Processing - Organizing file for Performance MVNC 19

Variable-length records

 Insertion » Which deleted record?

 Deletion » » Add records to list (stack?) Where File Processing - Organizing file for Performance MVNC 20

Variable-length records Insertion

 Select and use a deleted record  Break up records » pick a record » If size of deleted record bigger, break into two - a record to use and a new, smaller, deleted record.

» Put smaller deleted record back in list  Leave empty space at end » » pick a record If size of deleted record bigger, just leave empty space at end.

File Processing - Organizing file for Performance MVNC 21

Variable-length records Fragmentation

 Recall fragmentation in Fixed-length records » » » At the end of fields if fixed length fields At the end of records in variable length fields Called

internal fragmentation

 Leaving space and the end of a variable length records also leads to

internal fragmentation

.

 Breaking up variable length records get rid of fragmentation, right? Wrong!

File Processing - Organizing file for Performance MVNC 22

Variable-length records Fragmentation

 As records get broken up, smaller and smaller pieces get left over.

 These pieces are

external fragmentation

File Processing - Organizing file for Performance MVNC 23

Variable-length records Insertion strategy

 How to pick record to use?

 First Fit » Use first deleted record found in list  Best Fit » Use deleted record closest in size  Worst Fit » » Use deleted record that is largest No good when not breaking up records!

File Processing - Organizing file for Performance MVNC 24

Variable-length records Insertion

 How do we find the record with the desired size?

» » Search them ALL!

Keep the records in sorted order by record size – Increasing size facilitates Best fit – Decreasing size facilitates worst fit (just pick first in list) – This increases deletion time!

File Processing - Organizing file for Performance MVNC 25

Variable-length records Reducing fragmentation

 Merge adjacent free records  How do we know if a newly deleted record is adjacent to a free record?

» » Search the deleted list Keep deleted records sorted by position in file – This makes finding of adjacent free space trivial – Costs more at deletion time File Processing - Organizing file for Performance MVNC 26

Fast Searching

 Binary Searching » » O(log n), where n is number of records requires file be sorted  Question - how do we sort file?

File Processing - Organizing file for Performance MVNC 27

File Sorting

 Sort in Ram » » » read in entire file - sort Called

internal sorting

Limited by size of memory File Processing - Organizing file for Performance MVNC 28

Binary Search - Problems

 Binary searching requires more then one or two accesses » » » » Accesses are VERY expensive Access are very random (much seek time) 100,000 requires average of 16.5 accesses We would like to approach the speed of a direct lookup!

File Processing - Organizing file for Performance MVNC 29

Binary Search - Problems

 Keeping a file sorted is expensive » Every record added must be entered in sorted order » Reordering is costly  Internal sorted is limited to small files » We will see there are sort methods to sort a file that will not fit in memory. But it is still expensive!

File Processing - Organizing file for Performance MVNC 30

Keysorting

 Rather then sorting file, we could sort an array of primary keys, where each key is accompanied by the address of the associated record.

 Pointer could be a byte offset from start, or (if records fixed length) a RRN.

 After sort keys, the file can be rewritten in order.

File Processing - Organizing file for Performance MVNC 31

Keysorting

 Advantages » Keys can be sorted in smaller space then whole file » Faster to sort (swap!) keys then entire records File Processing - Organizing file for Performance MVNC 32

Keysorting

 Disadvantages » » Still limited in size to key lists which fit in memory Sequential processing cannot not take advantage of buffering!

File Processing - Organizing file for Performance MVNC 33

Keysorting

 Alternative - keeping sorted keylist,pointer structure around.

 Is a type of index file!

 Can be read in and searched in memory!

File Processing - Organizing file for Performance MVNC 34

Key Sorted Index

 Advantages » Keys and pointers can be searched in memery. Only one I/O per lookup!

» File can be maintained in ANY order. Searching and key order sequential processing still possible.

File Processing - Organizing file for Performance MVNC 35

Key Sorted Index

 Disadvantages » Sequential processing cannot not take advantage of buffering!

» Pinned records – Records in main file cannot change location without invalidating index file!

– Must either maintain index in parallel, or rebuild!

File Processing - Organizing file for Performance MVNC 36