Sets of Digital Data - University of Georgia

Download Report

Transcript Sets of Digital Data - University of Georgia

Sets of Digital Data
CSCI 2720
Fall 2005
Kraemer
Digital Data
In earlier work with BSTs and various
balanced trees, we compared keys for
order or equality
Here, we take advantage of structure of
key
Use it as an index, or
Decompose string key into characters, or
Treat key as numerical quantity on which we
can perform operations
Assumptions
We will construct and manipulate sets that
Are drawn from a universe U of size N
U = {u0, …uN-1}
A relatively simple procedure exists by
which we can compute, for an element u 
U, the index i such that u = ui.
Easy if U is set of integers
Also easy if U is set of characters with
character codes in a contiguous interval
Bit Vector
Used to represent a subset S  U
A table of N bits, Bits[0.. N-1]
Bits[i] == 1 if ui  S
Bits[i] == 0 if ui  S
Example: today’s attendance
0
1
1
2
3
4
5
6 -- student number
1
0
1
0
1
1
1 = present
0 = absent
Bit Vectors
Assume:
determining element index takes constant time
accessing position in table takes constant time
May actually take several ops, and depend
somewhat on N(size of universe), but not on
size of set represented
Then:
Insert, Delete, Member are constant time ops
Bit Vectors
A subset of a set of size N always takes N
bits to represent, independent of size of
subset
Makes sense if:
N is not too large
 need to represent sets of size comparable to N
Storage Efficiency
 Bit Vector vs. Binary Trees
 Binary Tree, set of size n
 Requires n(2p + K) bits
 K >= lg N, size of field to represent key value
 p = number of bits in a pointer
 Bit Vector, takes N bits
 If n  N, then bit vector more efficient
 If p = K = 32, then tree becomes more space
efficient when n/N  1%
 Actually, when n(2p + K) = N, which is when n/N = 1/96
When to use Bit Vectors?
When universe is relatively small
When sets are large in relation to size of
universe
Advantages of Bit
Vectors
O(1) implementation of Insert, Delete,
Member
Union and Intersection easy
Implement via Boolean and and or operations
May actually take less than one op/element, as
operations are performed on full machine word
If machine word == 32, then one machine operation
handles 32 potential elements of set
Disadvantages of Bit
Vectors
 On some computers access to individual bits can
require shifting and masking operations
(expensive)
 Result is that Member may be much more
expensive than Union
 Initialization takes (N) -- zero all the bits in the
vector
 But can use constant time initialization algorithm
 But that makes storage requirement go to 2p + 1 bits per
element
 So, in practice, just use machine ops to set to zero,
which are efficient
Tries and Digital Search
Trees
If the key can be decomposed into
characters, then the characters of the key
can be used as indices
Tries are based on this idea
“trie” is the middle symbol of retrieval, a pun
on tree, but pronounced “try”
Tries
Assume k possible character values
A trie is a (k+1)-ary tree
each node a table of k+1 pointers
One pointer for each possible character
One for the end of string character, 
Trie Example
Tries
Path for key of m characters is length m,
with pointer at 
Don’t need to store key itself .. It is the
path followed.
Info field might be pointed to by  element
Tries: Analysis
 Let:
 n be the number of keys stored in a trie
 l be the length(in characters) of the longest key
 s be the number of nodes in the trie
 k be the size of the alphabet
 Pro:
 Access time is O(l), independent of k, n and s
 Con:
 Size -- requires (k+1) * s * p bits
 Most pointers are null, so lots of wasted space
Strategies for reducing
storage requirements of
tries
1. Implement a k-ary trie with m nodes as a
2-D, m by k table
0
1
2
3
4
5
…. 
A
B
C D
E … M …. P …. T
4
6
5
-
-
-
7
-
1
-
8
2
-
-
3
-
-
-
-
-
-
-
-
-
-
-
-
-
9 10 -
-
Table approach
 Number the nodes in the diagram of slide 13 from
1 to m
 The table entry corresponding to jth child of ith
node is the index of the child node
 How does that save space? Just as many nodes
and elements as on slide 13
 … need only ceil(lg(m)) bits to represent, smaller than a
pointer …
Patricia Tree:
Another strategy for
reducing space in a trie
 Patricia tree
 Practical Algorithm to Retrieve Information Coded in
Alphanumeric
 Eliminate nodes with only one nonempty child
 Can now skip right from T to  in TURING in our example
 Skip from MA …. To E or  in the MENDEL , MENDELEEV
chain
 But need to store with each node the index of the
character on which it discriminates
 And need to store the key itself at the leaf
Patricia tree
de la Briandais trees
 Another strategy to save space vs. standard tries
 Use a linked list instead of a table at the node
level
 Each pointer labeled with the character it indexes
 longer search time than tries; depends on size of
character set
 saves significant amounts of memory
de la Briandais
Another strategy …
Use tries at the first few levels
Use ordinary BSTs or de la Briandais at the
lower levels
reasoning:
 speed advantage at the top, but not too much
extra memory required
save space at lower levels
Digital Search Trees
 Treat keys as bit strings
 (strings over the alphabet {0,1})
 Binary tree – search directed left on 0, right on 1
 Each node contains not only two pointers, but also
contains a key that matches that string prefix
 Compare for equality before searching left or right
 If frequencies are known, store higher frequency
keys nearer root
 Can be grown dynamically
 Expected Search time: O(log n)
Digital Search Tree