Arrays and Strings - University of Georgia

Download Report

Transcript Arrays and Strings - University of Georgia

Arrays and Strings
CSCI 2720
University of Georgia
Spring 2007
The Array ADT


Stores a sequence of consecutively
numbered objects
Each object can be accessed
(selected) using its index
More formally ….

Given integers l and u



with u >= l-1,
the interval l ..u is defined to be the set
of integers i such that l <=i<=u
An array is a function

from any interval
(the index set of the array)

to a set of objects or elements

the value set of the array
Formally, continued …

If X is an array and i is a member of
its index set,


We write X[i] to denote the value of X
at i
The members of the range of X are
known as the elements of X
The Array ADT





Access(X,i)
Length(X)
Assign(X,i,v)
Initialize(X,v)
Iterate(X,F)
Access(X,i)

Return X[i]
Length(X)

Return u – l + 1, the number of
elements in I (the interval on X)
Assign(X,i,v)


Replace array X with a function
whose value on i is v (and whose
value on all other arguments is
unchanged).
We also write this as:

X[i] <- v
Initialize(X,v)

Assign v to every element of array
X
Iterate(X,F)

Apply F to each element of array X
in order, from smallest index to
largest index.


F is an action on a single array
element.
for i = l to u do
F(X[i])
String


A special type of array
If  is any finite set, then a string over 
is




an array whose value set is  and whose index
set is 0..n-1 for some non-negative n
The set  is called an alphabet
Each element of  is called a character
 often consists of the Roman alphabet,
plus digits, the space, and common
punctuation marks
Strings

If w is a string, then


If w = TREE, then



Length(w) = n
 Also written |w|
w is a string of length 4
w[0] = T, w[1] = R
The null string is the string whose
domain is the empty interval


Has no elements
Written 
String-specific operations


Substring(w,i,m)
Concat(w1,w2)
Substring(w,i,m)



w is a string; i,m integers
Returns the string of length m containing
the portion of w that starts at i
Formally:



returns a string w’ with indices 0 .. m-1 such
that w’[k] = w[i+k] for each k satisfying
0 <=k <=m
only applies if
 0 <= i <= |w|
and
 0 <= m <= (|w| -1)
otherwise, returns 
Substring …

Example: w = SNICKERING




Prefix


Substring(w,2,3) returns ICK
Substring(w,3,0) returns 
Substring(w,10,3) returns 
each substring(w,0,j) for 0<= j <= |w| is a
prefix of w
Suffix

each substring(w,j, |w| - j) for 0<= j <= |w| is
a suffix of w
Concat(w1,w2)

returns a string





of length |w1| + |w2|
whose characters are the characters of
w1 followed by those of w2
Concat(w,) = Concat(,w) = w
Example:
w1 = BIRD, w2 = DOG,


Concat(w1,w2) = BIRDDOG
Concat(w2,w1) = DOGBIRD
Tables vs. Arrays



Table = physical organization of
memory into sequential cells
Array = an abstract data type, with
specific operations
Arrays frequently implemented
using tables, but may be
implemented in other ways
Multi-dimensional arrays


a function whose range is any set V
and whose domain is the Cartesian
product of any number of intervals
the Cartesian product of intervals I1,
I2, …Id, written as I1 x I2 x … Id, is
the set of all d-tuples <i1, i2, … id>
such that ik  Ik for each k.
Multi-D arrays



if C is a multidimensional array and if
i =<i1, i2, … id> then C[i1, i2, … id] is the
value of C at i
The dimension of a multi-D array is the
number of intervals whose Cartesian
product makes up the index set
The size of the kth dimension of such an
array is the number of elements in Ik
Contiguous Representation of Arrays:
Why Computer Scientists start counting at 0

Store elements in a table:
x
x+4
17
x[0]

x+8
x+12
43
87
94
x[1]
x[2]
x[3]
x+16 x+20
101
143
x[4]
x[5]
Each element begins at x + 4(i-1)



x = starting address of the array
4 = sizeof(element)
i = index of element of interest
More generally

if X is the address of the first cell in
memory of an array with indices
l..u, and if each element has size L,
then


the ith element is stored at address
X + L * (i-1)
the element can be retrieved in
constant time
When iterating through the array

can save a few operations by doing
“pointer arithmetic”



just add L to current address to get
next element
don’t have to subtract, multiply, add
still linear in number of elements, but
faster linear
Where’s the needed info stored?


Could store L, l, and u at the starting
address of X .. but would need to adjust
the formula to calculate the location of
individual cells.
If language is strongly typed, some or all
of L, l, and u may be part of the definition
of X and stored elsewhere

C/C++ -- L part of typing info, l assumed to be
0, u not stored (programmer needs to keep
track)
Where’s the needed info stored?

Can use a sentinel value after the
last element of the array

C/C++ -- we do this with strings. Store
a ‘\0’ at the end

means that you need to iterate through to
find Length, no longer O(1)
What if the elements have different
lengths?

allot Max to all elements



wasted space
can still access in O(1) time
store pointers to elements




pointers require memory
need 2 accesses (calculate location of pointer,
then follow it), but still O(1)
pointer to element is at X + P * (i-1)
easy to swap even large or complex elements
… just swap their pointers
2D arrays


can also represent in contiguous
memory … but do we keep rows
together or do we keep columns
together??
Example: array with logical ordering
A
E
I
B
F
J
C
G
K
D
H
L
Row major
A
B
C
D
E
F
G
H
I
J
K
L
v.
column-major
A
E
I
B
F
J
C
G
K
D
H
L
Where are 2D elements stored?

Row-major: R[i,j] stored at:

R + L * (NPR(i-1) + (j-1)), where
R is starting address of the array
 L is the size of each element
 NPR is the number of elements per row
 i is the row number
 j is the column number

Where are 2D elements stored?

Column-major: C[i,j] stored at:

C + L * (NPC(j-1) + (i-1)), where
C is starting address of the array
 L is the size of each element
 NPC is the number of elements per
column
 i is the row number
 j is the column number

Multi-dimensional arrays
Constant-time initialization
procedure Initialize(ptr M, value v)
//Initialize each element of M to v
Count(M) <- 0
Default(m) <- v
function Valid(int I, ptr M): boolean
//return true if M[i] has been modified
//since last Initialize
return (0 <= When(M)[i] < Count(M)) and
(Which(m)[When(M)[i]] == i)
Constant time initialization
function Access(int i, ptr M):value
// return M[i]
if Valid(I,M) then
return Data(M)[i]
else
return Default(M)
procedure Assign(ptr M, int I, value v)
// Set M[i] <- v
if not Valid(i, M) then
When(M)[i] <- Count(M)
Which(M)[Count(M)] <- i
Count(M) <- Count(M) + 1
Data(M)[i] <- v
But requires 3x memory …
Which(M)
When(M)
Data(M)
Sparse Arrays




Definitions
List Representations
Hierarchical Tables
Arrays with Special Shapes
Sparse Arrays




some arrays contain only a few elements …
wouldn’t it be more efficient to store only the
non-null values? same idea when only a few
values differ from the majority
some arrays have a special shape … upper
diagonal matrix, symmetric matrix
sparse array : an array in which only a small
fraction of the elements are significant in some
way
null element: doesn’t need to be stored; is either
actually null, or well-known, or easily calculated
List representations
Hierarchical tables
Upper-triangular matrix
Representation of Strings



Background
Huffman Encoding
Lempel-Ziv Encoding
Representing Strings




How much space do we need?
Assume we represent every
character.
How many bits to represent each
character?
Depends on ||
Bits to encode a character

Two character alphabet{A,B}

one bit per character:


0 = A, 1 = B
Four character alphabet{A,B,C,D}
two bits per character:
 00 = A, 01 = B, 10 = C, 11 = D
 Six character alphabet {A,B,C,D,E, F}



three bits per character:
000 = A, 001 = B, 010 = C, 011 = D, 100=E,
101 =F, 110 =unused, 111=unused
More generally




The bit sequence representing a character
is called the encoding of the character.
There are 2n different bit sequences of
length n,
ceil(lg||) bits required to represent each
character in 
if we use the same number of bits for
each character then length of encoding of
a word is |w| * ceil(lg||)
Can we do better??

If  is very small, might use runlength encoding
What if …

the string we encode doesn’t use all
the letters in the alphabet?
 log2(ceil(|set_of_characters_used|)
 But
then also need to store / transmit
the mapping from encodings to
characters
 … and is typically close to size of
alphabet
Huffman Encoding:


Still assumes encoding on a per-character
basis
Observation: assigning shorter codes to
frequently used characters can result in overall
shorter encodings of strings
 requires assigning longer codes to rarely used
characters
 Problem:


when decoding, need to know how many bits to
read off for each character.
Solution:

Choose an encoding that ensures that no
character encoding is the prefix of any other
character encoding. An encoding tree has this
property.
A Huffman Encoding Tree
21
0
1
9
12
0
E
1
5
0
7
1
3
A
0
1
2
3
4
T
R
N
21
0
9
E
1
5
0
7
1
3
A
000
T
001
R
010
N
011
E
1
1
12
0
A
0
1
2
3
4
T
R
N
Weighted path length
Weighted path = Len(code(A)) * f(A) +
A 000
T 001
Len(code(T)) * f(T) + Len(code(R) ) * f(R) +
Len(code(N)) * f(N) + Len(code(E)) * f(E)
R 010
= (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1)
N 011
= 9 + 6 + 9 + 12 + 9 = 45
E 1
Claim (proof in text) : no other encoding can result
in a shorter weighted path length
Building the Huffman Tree
A
3
T
4
R
4
E
5
Building the Huffman Tree
R
4
7
A
3
T
4
E
5
Building the Huffman Tree
R
4
E
5
7
A
3
T
4
Building the Huffman Tree
7
9
R
4
E
5
A
3
T
4
Building the Huffman Tree
7
A
3
9
T
4
R
4
E
5
Building the Huffman Tree
16
7
A
3
9
T
4
R
4
E
5
Building the Huffman Tree
16
1
0
7
0
A
3
00
9
1
0
1
T
4
R
4
E
5
01
10
11
Taking a step back …

Why do we need compression?


rate of creation of image and video
data
image data from digital camera
today 1k by 1.5 k is common = 1.5
mbytes
 need 2k by 3k to equal 35mm slide = 6
mbytes


video at even low resolution of

512 by 512 and 3 bytes per pixel, 30
frames/second
Compression basics

video data rate



mpeg-1 compresses



23.6 mbytes/second
2 hours of video = 169 gigabytes
23.6 mbytesdown to 187 kbytes per second
169 gigabytes down to 1.3 gigabytes
compression is essential for both
storage and transmission of data
Compression basics

compression is very widely used





jpeg, gif for single images
mpeg1, 2, 3, 4 for video sequence
zip for computer data
mp3 for sound
based on two fundamental
principles

spatial coherence and temporal
coherence
similarity with spatial neighbor
 similarity with temporal neighbor

Basics of compression






character = basic data unit in the input
stream
represents byte, bit, etc.
strings = sequences of characters
encoding = compression
decoding = decompression
codeword = data elements used to
represent input characters or character
strings
codetable = list of codewords
Codeword

encoding/compression takes


characters/strings as input and use
codetable to decide on which
codewords to produce
decoder/decompressor takes

codewords as input and uses same
codetable to decide on which
characters/strings to produce
Codetable




clearly both encoder and decoder
must pass the encoded data as a
series of codewords
also must pass the codetable
the codetable can be passed
explicitly or implicitly
that is we either



pass it across
agree on it beforehand (hard wired)
recreate it from the codewords (clever!)
Basic definitions

compression ratio =



lossless compression



size of original data / compressed data
basically higher compression ratio the better
output data is exactly same as input data
essential for encoding computer processed
data
lossy compression


output data not same as input data
acceptable for data that is only viewed or
heard
Lossless versus lossy




human visual system less sensitive to
high frequency losses and to losses in
color
lossy compression acceptable for visual
data
degree of loss is usually a parameter of
the compression algorithm
tradeoff - loss versus compression


higher compression => more loss
lower compression => less loss
Symmetric versus asymmetric

symmetric



encoding time == decoding time
essential for real-time applications (ie.
video or audio on demand)
asymmetric


encoding time >> decoding
ok for write-once, read-many
situations
Entropy encoding



compression that does not take into
account what is being compressed
normally is also lossless encoding
most common types of entropy
encoding




run length encoding
Huffman encoding
modified Huffman (fax…)
Lempel Ziv
Source encoding



takes into account type of data (ie.
visual)
normally is lossy but can also be
lossless
most common types in use:




JPEG, GIF = single images
MPEG = sequence of images (video)
MP3 = sound sequence
often uses entropy encoding as a
sub-routine
Run length encoding



one of simplest and earliest types of
compression
take account of repeating data (called
runs)
runs are represented by a count along
with the original data



eg. AAAABB => 4A2B
do you run length encode a single
character?
no, use a special prefix character to
represent start of runs
Run length encoding

runs are represented as
<prefix char><repeat count><run char>

prefix char itself becomes
<prefix char>1<prefix char>



want a prefix char that is not too
common
an example early use is MacPaint
file format
run length encoding is lossless and
has fixed length codewords
MacPaint File Format
Run length encoding





works best for images with solid
background
good example of such an image is
a cartoon
does not work as well for natural
images
does not work well for English text
however, is almost always a part of
a larger compression system
Huffman encoding




assume we know the frequency of
each character in the input stream
then encode each character as a
variable length bit string, with the
length inversely proportional to the
character frequency
variable length codewords are used;
early example is Morse code
Huffman produced an algorithm for
assigning codewords optimally
Huffman encoding


input = probabilities of occurrence of each
input character (frequencies of occurrence)
output is a binary tree





each leaf node is an input character
each branch is a zero or one bit
codeword for a leaf is the concatenation of bits
for the path from the root to the leaf
codeword is a variable length bit string
a very good compression ratio (optimal)?
Huffman encoding

Basic algorithm
Mark all characters as free tree nodes
While there is more than one free node
Take two nodes with lowest freq. of occurrence
Create a new tree node with these nodes as
children and with freq. equal to the sum of
their freqs.
Remove the two children from the free node list.
Add the new parent to the free node list

Last remaining free node is the root of the
binary tree used for encoding/decoding
Huffman example



a series of colors in an 8 by 8
screen
colors are red, green, cyan, blue,
magenta, yellow, and black
sequence is




rkkkkkkk
kkkrrkkk
kkrrrrgg
kkbcccrr
gggmcbrr
bbbmybbr
gggggggr
grrrrgrr
Huffman example
Huffman example
Huffman example
Huffman example
Fixed versus variable length
codewords






run length codewords are fixed length
Huffman codewords are variable length
length inversely proportional to frequency
all variable length compression schemes
have the prefix property
one code can not be the prefix of another
binary tree structure guarantees that this
is the case (a leaf node is a leaf node!)
Huffman encoding

advantages



maximum compression ratio assuming correct
probabilities of occurrence
easy to implement and fast
disadvantages


need two passes for both encoder and decoder
 one to create the frequency distribution
 one to encode/decode the data
can avoid this by sending tree (takes time) or
by having unchanging frequencies
Modified Huffman encoding



if we know frequency of occurrences, then
Huffman works very well
consider case of a fax; mostly long white
spaces with short bursts of black
do the following




run length encode each string of bits on a line
Huffman encode these run length codewords
use a predefined frequency distribution
combination run length, then Huffman
Beyond Huffman Coding …


1977 – Lempel & Ziv, Israeli
information theorists, develop a
dictionary-based compression
method (LZ77)
1978 – they develop another
dictionary-based compression
method (LZ78)
The LZ family

LZ77





LZR
LZSS
LZB
LZH – used by zip and unzip
LZ78





LZW – Unix compress
LZC – Unix compress
LZT
LZMW
LZJLZFG
Overview of LZ family

To demonstrate:


simple alphabet containing only two
letters, a and b,
and create a sample stream of text
LZ family overview

Rule: Separate this stream of characters
into pieces of text so that the shortest
piece of data is the string of characters
that we have not seen so far.
Sender : The Compressor

Before compression, the pieces of
text from the breaking-down
process are indexed from 1 to n:

indices are used to number the pieces of data.



The empty string (start of text) has index 0.
The piece indexed by 1 is a. Thus a, together
with the initial string, must be numbered Oa.
String 2, aa, will be numbered 1a, because it
contains a, whose index is 1, and the new
character a.

the process of renaming pieces of
text starts to pay off.


Small integers replace what were once
long strings of characters.
can now throw away our old stream of
text and send the encoded information
to the receiver
Bit Representation of Coded
Information

Now, want to calculate num bits
needed



each chunk is an int and a letter
num bits depends on size of table
permitted in the dictionary
every character will occupy 8 bits
because it will be represented in US
ASCII format
Compression good?


in a long string of text, the number
of bits needed to transmit the coded
information is small compared to
the actual length of the text.
example: 12 bits to transmit the
code 2b instead of 24 bits (8 + 8 +
8) needed for the actual text aab.
Receiver: The Decompressor
(Implementation


receiver knows exactly where boundaries are, so no
problem in reconstructing the stream of text.
Preferable to decompress the file in one pass;
otherwise, we will encounter a problem with
temporary storage..
Lempel-Ziv applet

See
 http://www.cs.mcgill.ca/~cs251/
OldCourses/1997/topic23/#JavaA
pplet
Lempel Ziv Welsch (LZW)






previous methods worked only on
characters
LZW works by encoding strings
some strings are replaced by a single
codeword
for now assume codeword is fixed (12
bits)
for 8 bit characters, first 256 (or less)
entries in table are reserved for the
characters
rest of table (257-4096) represent strings
LZW compression






trick is that string-to-codeword mapping
is created dynamically by the encoder
also recreated dynamically by the decoder
need not pass the code table between the
two
is a lossless compression algorithm
degree of compression hard to predict
depends on data, but gets better as
codeword table contains more strings
LZW encoder
Initialize table with single character strings
STRING = first input character
WHILE not end of input stream
CHARACTER = next input character
IF STRING + CHARACTER is in the string table
STRING = STRING + CHARACTER
ELSE
Output the code for STRING
Add STRING + CHARACTER to the string table
STRING = CHARACTER
END WHILE
Output code for string
Demonstrations

Another animated LZ algorithm …

http://www.data-compression.com/lempelziv.html
LZW encoder example

compress the string BABAABAAA
LZW decoder
Lempel-Ziv compression


a lossless compression algorithm
All encodings have the same length


But may represent more than one
character
Uses a “dictionary” approach –
keeps track of characters and
character strings already
encountered
LZW decoder example

decompress the string
<66><65><256><257><65><26
0>
LZW Issues



compression better as the code
table grows
what happens when all 4096
locations in string table are used?
A number of options, but encoder
and decoder must agree to do the
same thing



do not add any more entries to table
(as is)
clear codeword table and start again
clear codeword table and start again
LZW advantages/disadvantages

advantages





simple, fast and good compression
can do compression in one pass
dynamic codeword table built for each
file
decompression recreates the codeword
table so it does not need to be passed
disadvantages


not the optimum compression ratio
actual compression hard to predict
Entropy methods




all previous methods are lossless
and entropy based
lossless methods are essential for
computer data (zip, gnuzip, etc.)
combination of run length
encoding/huffman is a standard tool
are often used as a subroutine by
other lossy methods (Jpeg, Mpeg)
Lempel-Ziv compression


a lossless compression algorithm
All encodings have the same length


But may represent more than one
character
Uses a “dictionary” approach –
keeps track of characters and
character strings already
encountered
String Searching




Background
Knuth-Morris-Pratt algorithm
Boyer-Moore algorithm
Fingerprinting and the Karp-Rabin
algorithm