External Sorting - Welcome to myMVNU | Home

Download Report

Transcript External Sorting - Welcome to myMVNU | Home

Cosequential Processing
Chapter 8
File Processing - Cosequential Processing
MVNC
1
Cosequential Processing


Coordinated processing of two or more
sequential lists
Goals
»
To merge lists into a single sorted list (union)
–
»
Make a single sorted list from many
To match records with the same keys (intersection)
–
–
Apply transactions to a master file
Find entries which exist in multiple lists
File Processing - Cosequential Processing
MVNC
2
Cosequential Processing

Keys
»
»
Matching/merging may be by a single key or
several.
Number of keys only affects compare operator, not
sort strategy
File Processing - Cosequential Processing
MVNC
3
Master Transaction File
Processing




Common processing strategy on sequential
files.
Common since historically sequential
processing was the rule (tapes, cards)
Companies stored data in sequential files
Lists of “transactions” posted against these
record periodically.
File Processing - Cosequential Processing
MVNC
4
Master Transaction File
Processing

Consider a grocery store
»
»
»
»
Record of inventory for each type of item stored in
a large sequential file (master file)
As items sold, a the item number and quantity sold
posted (written) as records to a transaction file
As trucks deliver new items, item numbers and
quantities are entered into the transaction file.
As new types of items are added to inventory, or
old items are discontinued, entries about this are
placed in the transaction file.
File Processing - Cosequential Processing
MVNC
5
Master Transaction File
Processing

grocery store example:
Master File
Item # Item NameType Quan
20231 Shoe Shine (br) 6 4
20231 Shoe Shine (bl) 6 1
20177 Cottage Cheese 5 392
20179 Chicken Soup 6 32
20231 T-bone
2 43
....
U - Update
A - Add
D - Delete
File Processing - Cosequential Processing
Transaction File
Item # Trans Quan Item Name
20231 U -2
20231 U 50
20379 U -5
20443 U -4
20445 A 40 Corn Chips
20532 A 300 Butter
20534 D
20558 U 200
....
MVNC
6
Master Transaction File
Processing

Periodically update master from transaction
New Master
File
Transaction
File
Update
Operation
Old Master
File
File Processing - Cosequential Processing
Update
Messages
MVNC
7
Master Transaction File
Processing





Transactions are applied against master.
New master is created
Invalid Transactions result in Message
Important changes in Messages - audit trail
Transaction and master must be in sorted
order.
File Processing - Cosequential Processing
MVNC
8
Master Transaction File
Processing

Processing Scheme
Read record Mast from old Master and Trans from Transaction
While more records in both files
if Add and Trans.ID < Mast.ID, write Mast to new master
else If Trans.ID = Mast.ID then
If UPDATE then update record and write to new master
If Delete then continue (no write)
else trasaction error
else write Mast to new master
Read next from transaction, next from old master
If more records in old master, write to new master
If more records in transaction, give errors
File Processing - Cosequential Processing
MVNC
9
Merging


Merge two (or more) sorted lists into a single
sorted list
May remove duplicates (union) or keep
Bill
Gray
Hillery
Jenny
Linda
Mary
Randy
merge
Cathy
Fran
Kenny
Pete
Sally
Zeke
File Processing - Cosequential Processing
MVNC
Bill
Cathy
Fran
Gray
Hillery
Jenny
Kenny
Linda
Mary
Pete
Randy
Sally
Zeke
10
Merging
Merge(List1,Max1,List2, Max2,Result)
int next1 := 0; next2 := 0; out = 0;
while Max1 >= next1 and Max2 >= next2
if (List1[next1] > List2[next2])
Result[out++] := List2[next2++];
else
Result[out++] := List1[next1++];
if (List1 < Max1)
for (; next1 <= Max1 ; Result[out++] := List1[next1++]);
if (List2 < Max2)
for (; next1 <= Max2 ; Result[out++] := List2[next1++]);
File Processing - Cosequential Processing
MVNC
11
Sorting

Small files
»
»
sort completely in memory
Called internal sorting.
File Processing - Cosequential Processing
MVNC
12
Sorting

Larger files
»
»
»
may be too large to fit in memory simultaneously
require "external sorting"
Sorting using secondary devices
File Processing - Cosequential Processing
MVNC
13
External Sorting

Criteria for evaluating external sorting
algorithms
»

Internal sort comparison criteria
»
»
»

Different from internal sorts
Number of comparisons required
Number of swaps made
Memory needs
External sort comparison criteria
»
»
Dominated by I/O time
Minimize transfers between secondary storage and
main memory
File Processing - Cosequential Processing
MVNC
14
External Sorting

Two major external sorting methods
»
»
in situ - sort the file in place
use additional storage space
File Processing - Cosequential Processing
MVNC
15
External Sorting

Characteristics of in situ sorting
»
»
»
»
»
uses less file space, thus larger files may be
sorted.
if crash occurs during sort, file may be left in
corrupt state
in site sorts may be done on direct-access files
using standard internal type sorts.
direct-access required (may not be available)
performance of such algorithm's tends to be data
sensitive
File Processing - Cosequential Processing
MVNC
16
External Sorting



Consider a file with 1000 records, 120 bytes
each
We have 25,000 bytes available for a buffer.
Solution?
»
»
»
read in 200 records at a time, sort internally
This results in 5 sorted files
merge the resulting sorted files into 1sorted file
File Processing - Cosequential Processing
MVNC
17
Sort/Merge




A common non-in situ method is an algorithm
called "sort-merge"
"safe" sorting technique
performance is guaranteed
requires only serial file access
File Processing - Cosequential Processing
MVNC
18
Sort/Merge
Sort
Sort
Merge
Partition
Sort
Sort
File Processing - Cosequential Processing
MVNC
19
Sort/Merge

Sort/Merge techniques have two stages:
»
sort stage - sorted partitions are generated
–
»
Size depends on available memory
merge stage - sorted partitions are merged
(repetitively if necessary)
–
Why might more then one merge phase be needed?
File Processing - Cosequential Processing
MVNC
20
Basic Sort/Merge

initial partition size is 1
»
»
»


Merge begins immediately (no sort)
Smallest main memory use
requires only 2 buffers in memory.
File starts with N "sorted" files of size 1
Similar to internal merge/sort
File Processing - Cosequential Processing
MVNC
21
Improving Sort/Merge

Increase buffer size
»
»
»
»
Partitions sorted (in memory) with little I/O
Larger partitions mean fewer (I/O intensive)
merges needed
Take advantage of already sorted runs of data
Consider the "unsortedness" of the data
File Processing - Cosequential Processing
MVNC
22
Sort/Merge

Producing sorted partitions
»
»
»
internal sorting
natural selection - (use already sorted runs)
replacement selection
File Processing - Cosequential Processing
MVNC
23
Internal sorting



read M records (M determined by available
memory)
sort them using internal sorting techniques
write back out, creating a partition of size M
File Processing - Cosequential Processing
MVNC
24
Sort/Merge

Replacement selection (snowshovel)
»
»
»
files usually not totally out of order
take advantage of partial ordering in file
partition size varies with already existing ordering
File Processing - Cosequential Processing
MVNC
25
Replacement selection
(snowshovel)

Start with primary buffer of size N
(snowshovel)
1. Read in N records into buffer
2. Output record with smallest key
3. Replace with next record in file
4. if this new record is smaller then the last record
written, "freeze" (must wait for next partition)
5. if unfrozen records remain, go to 2
6. If all records frozen, unfreeze them all, start new
partition, go to 2
File Processing - Cosequential Processing
MVNC
26
Replacement selection
(snowshovel)



if file is sorted or almost sorted, one pass may
suffice for complete sort!
average partition length is 2N
Consider file with, N = 4:
»
29 42 3 7 9 101 99 87 89 100 16 8 12 2 15 [EOF]
File Processing - Cosequential Processing
MVNC
27
Natural Selection
Frozen records in the replacement scheme
take up space and search time.
 Natural, rather than freezing, writes these
unused records to a fixed length secondary
file (called reservoir)
 partition creation terminates when reservoir
full.
 Next, buffer is refilled first with records from
buffer, than records from file (if more needed)
 expected partition length is 2.718N if reservoir
File Processing
Cosequential
Processing
MVNC 30)
and -buffer
same
size - (about

28
Natural Selection

Redo example with reservoir size 4
»
29 42 3 7 9 101 99 87 89 100 16 8 12 2 15 [EOF]
File Processing - Cosequential Processing
MVNC
29
Distribution and Merging

Merging
»
»
required to bring the sorted partitions together into
a sorted whole
may require a series of merge “phases”, where
shorter partitions are merged into larger partitions
–
–
More then one partitions per file
Not all partitions can be openned at once
File Processing - Cosequential Processing
MVNC
30
Merging
Single phase
File Processing - Cosequential Processing
MVNC
31
Merging
Multiple phase
File Processing - Cosequential Processing
MVNC
32
Merging
Multiple Partitions / File
P1 P3
P1-2
P5
P7
P5-6
P9 P11
P2 P4
P9-10
P3-4
P1-4
P9-12
P6
P8 P10 P12
P7-8
P11-12
P5-8
P1-8
P9-12
P1-12
File Processing - Cosequential Processing
MVNC
33
Merging

Major issues - minimizing overall I/O
»
Different length partitions
–
»
Spend time simply reading and writing from one file
Left over partitions
–
Spend time simply copying partitions
File Processing - Cosequential Processing
MVNC
34
Distribution and Merging

Distribution
»
»
»
»
In order to merge, partitions must be “distributed”
to files in a manner facilitaing the merge process.
If 1 partition per file, distribution is trivial
If >1 partition per file, distribution should minimize
I/O
Several partitions may be placed in each file
File Processing - Cosequential Processing
MVNC
35
Balanced N-way merge




use as many files (or tapes) as the system
can open at once
Distribute the partitions evenly amoung F/2
files
repetitively merge back and forth between one
set of F/2 files and the other
Distribute the generated partitions evenly
amoung the F/2 output files
File Processing - Cosequential Processing
MVNC
36
Balanced 2-way merge
P1 P3
P5
P7
P9 P11
P2 P4
File 1
P1-2
P5-6
P6
P8 P10 P12
File 2
P9-10
P3-4
File 3
P7-8
P11-12
File 4
P1-4
P9-12
P5-8
File 1
File 2
P1-8
P9-12
File 3
File 4
P1-12
File 1
File Processing - Cosequential Processing
MVNC
37
Balanced 2-way merge

Example: 4 files, 700 records, 100 primary
records can be sorted in memory
1-100
201-300
401-500
601-700
1-200
1-400
401-600
1-700
1-700
101-200
301-400
501-600
File Processing - Cosequential Processing
201-400
401-700
601-700
MVNC
38
Balanced N-way merge

advantage
»

simple
disadvantage
»
»
wastes time if partition size different
spend time reading and write records without
actually merging
File Processing - Cosequential Processing
MVNC
39
Polyphase merging


Strategically distribute the partitions onto F
files based on the Fibonacci Sequence
Algorithm
»


During each phase merge the F smallest files until
the end of one file is reached.
After each phase at least one partition will
now be empty - this file becomes available
new place to merge into
Continue to merge until only one file exists
File Processing - Cosequential Processing
MVNC
40
Polyphase merging

Consider: Initially generate three files:
»
24 partitions, 20 partitions , and 13 partitions
File Processing - Cosequential Processing
MVNC
41
Polyphase merging

advantages
»

No overhead from merging partitions of different
sizes
disadvantages
»
»
»
complex management of files
must know partition sizes
still not completely optional - partition sizes not
always maximal.
File Processing - Cosequential Processing
MVNC
42