Sets of Digital Data - University of Georgia
Download
Report
Transcript Sets of Digital Data - University of Georgia
Sets of Digital Data
CSCI 2720
Fall 2005
Kraemer
Digital Data
In earlier work with BSTs and various
balanced trees, we compared keys for
order or equality
Here, we take advantage of structure of
key
Use it as an index, or
Decompose string key into characters, or
Treat key as numerical quantity on which we
can perform operations
Assumptions
We will construct and manipulate sets that
Are drawn from a universe U of size N
U = {u0, …uN-1}
A relatively simple procedure exists by
which we can compute, for an element u
U, the index i such that u = ui.
Easy if U is set of integers
Also easy if U is set of characters with
character codes in a contiguous interval
Bit Vector
Used to represent a subset S U
A table of N bits, Bits[0.. N-1]
Bits[i] == 1 if ui S
Bits[i] == 0 if ui S
Example: today’s attendance
0
1
1
2
3
4
5
6 -- student number
1
0
1
0
1
1
1 = present
0 = absent
Bit Vectors
Assume:
determining element index takes constant time
accessing position in table takes constant time
May actually take several ops, and depend
somewhat on N(size of universe), but not on
size of set represented
Then:
Insert, Delete, Member are constant time ops
Bit Vectors
A subset of a set of size N always takes N
bits to represent, independent of size of
subset
Makes sense if:
N is not too large
need to represent sets of size comparable to N
Storage Efficiency
Bit Vector vs. Binary Trees
Binary Tree, set of size n
Requires n(2p + K) bits
K >= lg N, size of field to represent key value
p = number of bits in a pointer
Bit Vector, takes N bits
If n N, then bit vector more efficient
If p = K = 32, then tree becomes more space
efficient when n/N 1%
Actually, when n(2p + K) = N, which is when n/N = 1/96
When to use Bit Vectors?
When universe is relatively small
When sets are large in relation to size of
universe
Advantages of Bit
Vectors
O(1) implementation of Insert, Delete,
Member
Union and Intersection easy
Implement via Boolean and and or operations
May actually take less than one op/element, as
operations are performed on full machine word
If machine word == 32, then one machine operation
handles 32 potential elements of set
Disadvantages of Bit
Vectors
On some computers access to individual bits can
require shifting and masking operations
(expensive)
Result is that Member may be much more
expensive than Union
Initialization takes (N) -- zero all the bits in the
vector
But can use constant time initialization algorithm
But that makes storage requirement go to 2p + 1 bits per
element
So, in practice, just use machine ops to set to zero,
which are efficient
Tries and Digital Search
Trees
If the key can be decomposed into
characters, then the characters of the key
can be used as indices
Tries are based on this idea
“trie” is the middle symbol of retrieval, a pun
on tree, but pronounced “try”
Tries
Assume k possible character values
A trie is a (k+1)-ary tree
each node a table of k+1 pointers
One pointer for each possible character
One for the end of string character,
Trie Example
Tries
Path for key of m characters is length m,
with pointer at
Don’t need to store key itself .. It is the
path followed.
Info field might be pointed to by element
Tries: Analysis
Let:
n be the number of keys stored in a trie
l be the length(in characters) of the longest key
s be the number of nodes in the trie
k be the size of the alphabet
Pro:
Access time is O(l), independent of k, n and s
Con:
Size -- requires (k+1) * s * p bits
Most pointers are null, so lots of wasted space
Strategies for reducing
storage requirements of
tries
1. Implement a k-ary trie with m nodes as a
2-D, m by k table
0
1
2
3
4
5
….
A
B
C D
E … M …. P …. T
4
6
5
-
-
-
7
-
1
-
8
2
-
-
3
-
-
-
-
-
-
-
-
-
-
-
-
-
9 10 -
-
Table approach
Number the nodes in the diagram of slide 13 from
1 to m
The table entry corresponding to jth child of ith
node is the index of the child node
How does that save space? Just as many nodes
and elements as on slide 13
… need only ceil(lg(m)) bits to represent, smaller than a
pointer …
Patricia Tree:
Another strategy for
reducing space in a trie
Patricia tree
Practical Algorithm to Retrieve Information Coded in
Alphanumeric
Eliminate nodes with only one nonempty child
Can now skip right from T to in TURING in our example
Skip from MA …. To E or in the MENDEL , MENDELEEV
chain
But need to store with each node the index of the
character on which it discriminates
And need to store the key itself at the leaf
Patricia tree
de la Briandais trees
Another strategy to save space vs. standard tries
Use a linked list instead of a table at the node
level
Each pointer labeled with the character it indexes
longer search time than tries; depends on size of
character set
saves significant amounts of memory
de la Briandais
Another strategy …
Use tries at the first few levels
Use ordinary BSTs or de la Briandais at the
lower levels
reasoning:
speed advantage at the top, but not too much
extra memory required
save space at lower levels
Digital Search Trees
Treat keys as bit strings
(strings over the alphabet {0,1})
Binary tree – search directed left on 0, right on 1
Each node contains not only two pointers, but also
contains a key that matches that string prefix
Compare for equality before searching left or right
If frequencies are known, store higher frequency
keys nearer root
Can be grown dynamically
Expected Search time: O(log n)
Digital Search Tree