Transcript Structures to Manage External Storage
Preliminaries
• Multiway trees have nodes with greater than two children. Multiway trees of order k have nodes with most k children • 2-3-4 Trees – For all
non
leaf nodes, Nodes with • One data items have two pointers • Two data items have three pointers • Three data items have four pointers – Children of pointer p have keys less than data item p.
– Children of the last pointer contains keys > than the last data item.
• B-Trees (Balanced, Boeing, broad, bushy, or Bayer (for Rudolph Bayer)??) – Each node contains links to as many children as can fit in a disk block .
Node Structures
• 2-3-4 tree
typedef struct Nodelink {
int numElems; Item *items[3]; struct Nodelink*links[4];
} Node;
• B-Tree
typedef struct Nodelink {
Item[k] items; Nodelink[k+1] nodes;
} Node;
2-3-4 Insertion Algorithm
• •
Insert( node )
If node is full
Then Call
splitNode If key is found in node, then
Return
“
DuplicatesNotAllowed
”
If
this is a leaf node,
Insert
the Data item and
Return Call
Insert(
appropriateChildPointer
)
SplitNode
Allocate a
newNode
and add the right child to it
If
parent exists
Then
Insert middleChild to parent node and point to
newNode
Else
Allocate new Root containing
middleChild
root’s
firstChildPointer
points to
newNode
root’s
secondChildPointer
points to
node
of
node
2-3-4 Deletion Algorithm
• Find the node to delete. If it is not a leaf node, replace its data by its successor, and then remove the successor.
• Cases to consider when deleting an item from a 2-3-4 node: 1.
If more than one item remains in a leaf node that contains the item to delete, simply remove it 2.
3.
If the item to delete is the only one in the node a. If there is a sibling with more than entry, then promote sibling and demote parent (possibly cascading) till the node to delete has a spare entry. Then delete the item in question b.
If all sibling nodes have only one entry, demote the parent and merge it with the sibling and then delete the current node. If the parent node now is empty. Recursively, traverse up the tree applying the above steps needed. If the root node becomes empty, simply remove it from the tree.
Visual Illustration of the 2-3-4-Delete
Case 1: 11, 22, 33 Case 2: 11, 22, 33 08, 09 12 Case 3: 08 11 12 08 08,11 11, 33 09, 22, 33 11 The algorithm recursively works its way up the tree
Characteristics of External Storage
• Speed is at least three orders of magnitude slower than memory.
• The extra overhead of searching through multiway tree nodes is more than compensated because less tree depth means less disk access.
• It is desirable to design the record sizes with disk block sizes in mind. Each disk read/write will be in multiples of its block size.
B-Tree Insertion Algorithm
• Differences from the 2-3-4 algorithm – Node splitting is from the bottom up rather than the top down.
• Advantage: The tree is kept more full.
• Disadvantage: A tree down could be followed by a tree up if multiple splits are necessary.
– Half of the items go to the new node, half remain in the old node.
– The middle key is promoted to the next level up.
– Contraction occurs when a node and a sibling have less than a full block of data items.
Note:
Standard B-tree implementations require at least half full nodes.
External Storage Optimizations
• It is more efficient to keep the index and data separate – Separate indices allow for multi-keyed files • Refinements exist to guarantee that no record is less than 2/3 full. Nodes are balanced over three siblings.
• Some implementations only have data pointers at the last level.
• A linked list of free disk blocks is often used to reclaim storage space after deletions.
• Efficiency: Assume a block contains 8096 bytes, each key is 24 bytes, the blocks are half full, and the pointers require 4 bytes. How many levels deep is the tree?
Other External Storage Algorithms • Create binary tree in memory for the index • Sorting external data with a type of merge sort
– On Each pass • Read large block from each piece of the file • Perform merge • Write back to second file • Keep reading blocks from each half until they run out.
– There will be log k N merges where k is the number of data elements that can fit in the memory blocks.