Transcript Document
Efficient encoding methods
Coding theory
refers to study of code properties and their suitability to specific applications. Efficient codes are used, e.g., in data compression, cryptography, error-correction, and group testing.
Codes play a central part in information theory, in particular in the design of efficient and reliable data transmission methods. Encoding methods focus on reduction (clever use) of redundancy in data compression (in error detection and correction mechanisms) 26/04/2020 Applied Algorithmics - week6 1
Data compression
Data compression
is the process of encoding information using fewer bits or other information-bearing units.
Compression is possible where the input data have
statistical redundancy
(e.g., in text files) or when relatively minor changes leading to smaller representation do not affect the
quality/fidelity
of the input (e.g., in pictures, video, or audio files).
Popular instances of data compression that many computer users are familiar with is the ZIP file format (texts), jpeg format (pictures) and mpeg format (for audio and video).
26/04/2020 Applied Algorithmics - week6 2
Data compression
Some compression schemes are
reversible compression
).
so that the original data can be reconstructed (
lossless data compression
), while others accept some loss of data in order to achieve higher compression (
lossy data
Compression is important because it helps reduce the consumption of expensive resources, such as disk space or connection bandwidth. However, compression requires increased information processing power, which can also be expensive. 26/04/2020 Applied Algorithmics - week6 3
Data compression - simple example
Run-Length Encoding
Data files frequently contain the same character repeated many times in a row. For example, text files use multiple spaces to separate sentences, indent paragraphs, format tables & charts, etc. Digitized signals can also have runs of the same value, indicating that the signal is not changing. For example, an image of the night-time sky would contain long runs of the character or characters representing the black background.
26/04/2020 Applied Algorithmics - week6 4
Data compression - simple example
Run-Length Encoding
In this scheme we focus on long runs of characters.
Each time a long run is encountered in the input data, two values are written to the output file. The first of these values is the character itself, i.e., a flag to indicate that run-length compression is beginning. The second value is the number of characters in the run.
26/04/2020 Applied Algorithmics - week6 5
Move to Front Transform
Move to Front
(
MTF
) transform is an encoding of data (typically a stream of bytes) designed to improve the performance of
entropy encoding
(coding scheme that assigns codes to symbols so as to match code lengths with the probabilities of the symbols) techniques of compression. When properly implemented, it is fast enough that its benefits usually justify including it as an extra step in data compression algorithms. 26/04/2020 Applied Algorithmics - week6 6
Move to Front Transform
In the context of
MTF
each byte value is encoded by its index in a list, which changes over the course of the algorithm. The list is initially stored, e.g., in order by byte value (0, 1, 2, 3, ..., 255). Therefore, the first byte is always encoded by its own value. However, after encoding a byte, that value is moved to the front of the list before continuing to the next byte.
26/04/2020 Applied Algorithmics - week6 7
Move to Front Transform - example
Let S=<9,9,8,8,8,1,9,9,9> be an input sequence and the initial content of the queue Q is [0,1,2,3,4,5,6,7,8,9] The encoding process will transform S as follows: S=< 9 ,9,8,8,8,1,9,9,9> and Q=[0,1,2,3,4,5,6,7,8, 9 ] S=< 9 , 9 ,8,8,8,1,9,9,9> and Q=[ 9 ,0,1,2,3,4,5,6,7,8] S=< 9 , 0 , 8 ,8,8,1,9,9,9> and Q=[9,0,1,2,3,4,5,6,7, 8 ] S=< 9 , 0 , 9 , 8 ,8,1,9,9,9> and Q=[ 8 ,9,0,1,2,3,4,5,6,7] S=< 9 , 0 , 9 , 0 , 8 ,1,9,9,9> and Q=[ 8 ,9,0,1,2,3,4,5,6,7] S=< 9 , 0 , 9 , 0 , 0 , 1 ,9,9,9> and Q=[8,9,0, 1 ,2,3,4,5,6,7] S=< 9 , 0 , 9 , 0 , 0 , 3 , 9 ,9,9> and Q=[1,8, 9 ,0,2,3,4,5,6,7] S=< 9 , 0 , 9 , 0 , 0 , 3 , 2 , 9 ,9> and Q=[ 9 ,1,8,0,2,3,4,5,6,7] S=< 9 , 0 , 9 , 0 , 0 , 3 , 2 , 0 , 9 > and Q=[ 9 ,1,8,0,2,3,4,5,6,7] S=< 9 , 0 , 9 , 0 , 0 , 3 , 2 , 0 , 0 > and Q=[ 9 ,1,8,0,2,3,4,5,6,7] Where the blue value refers to the position of the symbol in the last instance of Q 26/04/2020 Applied Algorithmics - week6 8
Burrows-Wheeler Transform
The Burrows-Wheeler transform
(
BWT
), a.k.a. block-sorting compression, is one of the most popular method in data compression. It was invented by Michael Burrows and David Wheeler, in 90-ties.
When a character string is transformed by the BWT, none of its characters change value. The transform rearranges in clever for the order of the characters in the string. If the original string had several substrings that occurred frequently, then the transformed string will have several places where a single character is repeated multiple times in a row. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as
move-to-front transform
and
run-length encoding
.
26/04/2020 Applied Algorithmics - week6 9
Cyclic rotations
For
0
≤
k
≤
n-1
, the
k th
cyclic rotation of the string
w
=
w[0..n-1]
is another string
v
=
v[0..n-1]
,
s.t.,
v[i]
=
w[(i+k) mod n]
w v
k
x y y x
26/04/2020 Applied Algorithmics - week6 10
Burrows-Wheeler Transform
The
Burrows-Wheeler Transform
transform for the string
w
=
w[0..n-1]
is defined as follows: Create a square matrix
M[n x n]
contains the
k th
in which the cyclic rotation of
w k th
row Sort rows of
M
in lexicographic order Store the string represented by the last column of
M
And the index of row which contains the position of the original string
w
(i.e.,
0 th
cyclic rotation) 26/04/2020 Applied Algorithmics - week6 11
Burrows-Wheeler Transform (example)
Consider
5
th Fibonacci word
f 5
=
babbabab
[0] [1] [2] [3] [4] [5] [6] [7]
b a b b a b a b a b b a b a b b b b a b a b b a b a b a b b a b a b a b b a b b b a b b a b b a a b b a b b a b b b a b b a b a
BWT
[0] [1] [2] [3] [4] [5] [6] [7]
a b a b b a b b a b b a b a b b a b b a b b a b b a b a b b a b b a b b a b a b b a b b a b b a b b a b a b b a b b a b b a b a
[4] [1] [6] [3] [0] [5] [2] [7] 26/04/2020 The output string
bbbbbaaa
and position [4] Applied Algorithmics - week6 12
Burrows-Wheeler Transform
The Burrows-Wheeler transform can be computed by the algorithm that constructs suffix arrays Which means that the Burrows-Wheeler transform can be computed
in linear time
The Burrows-Wheeler transform is reversible and the original string can be recovered efficiently via generation of consecutive columns of matrix
M
26/04/2020 Applied Algorithmics - week6 13
a a a b b b b b
Burrows-Wheeler (reverse) Transform --
……..
……..
……..
……..
……..
……..
……..
……..
hard way
b b b b b a a a ba ba ba bb bb ab ab ab ab ab ba ba ba bb bb ab ……..
……..
……..
……..
……..
……..
……..
…….
b b b b b a a a bab bab bab bba bba aba abb abb aba ……..
abb abb bab bab bab bba bba ……..
……..
……..
……..
……..
……..
…….
b b b b b a a a baba babb babb bbab bbab abab abba abba abab ……..
abba abba baba babb babb bbab bbab ……..
……..
……..
……..
……..
……..
…….
b b b b b a a a babab babba babba bbaba bbabb ababb abbab abbab ababb …..
abbab abbab babab babba babba bbaba bbabb …..
…..
…..
…..
…..
…..
…..
b b b b b a a a
26/04/2020
bababb babbab babbab bbabab bbabba ababba abbaba abbabb ababba …..
abbaba abbabb bababb babbab babbab bbabab bbabba …..
…..
…..
…..
…..
…..
…..
b b b b b a a a bababba babbaba babbabb bbababb bbabbab ababbab abbabab abbabba
Applied Algorithmics - week6
ababbab ..
abbabab ..
abbabba ..
bababba babbaba babbabb bbababb bbabbab ..
..
..
..
..
b b b b b a a a bababbab babbabab babbabba bbababba bbabbaba ababbabb abbababb abbabbab ababbabb abbababb abbabbab bababbab babbabab babbabba bbababba bbabbaba
14
Burrows-Wheeler (reverse) Transform - easy way based on stable sorting property
Corresponding symbols
b b b b b a a a
BWT
b b b b b a a a
1 st col
b b a a a b b b
Structure
b b b b b a a a
Reverse BWT
b b b b b a a a b b b b b a a a
1 st col BWT Just follow the cycle
b a b b a b a b
0 1 2 3 4 5 6 7 4 1 6 3 0 5 2 7
26/04/2020 Applied Algorithmics - week6 15
Lempel-Ziv-Welch Compression
The
Lempel-Ziv-Welch (LZW) dictionary
compression algorithm is an example of dictionary based methods, in which longer fragments of the input text are replaced by much shorter references to code words stored in the special set called LZW is an implementation of a lossless data compression algorithm developed by Abraham Lempel and Jacob Ziv. It was published by Terry Welch in 1984 as an improved version of the LZ78 dictionary coding algorithm developed by Lempel and Ziv. 26/04/2020 Applied Algorithmics - week7 16
LZW Compression
The key insight of the method is that it is possible to automatically build a dictionary of previously seen strings in the text being compressed. The dictionary starts off with 256 entries, one for each possible character (single byte string). Every time a string not already in the dictionary is seen, a longer string consisting of that string appended with the single character following it in the text, is stored in the dictionary.
26/04/2020 Applied Algorithmics - week7 17
LZW Compression
The output consists of integer indices into the dictionary. These initially are 9 bits each, and as the dictionary grows, can increase to up to 16 bits. A special symbol is reserved for "flush the dictionary" which takes the dictionary back to the original 256 entries, and 9 bit indices. This is useful if compressing a text which has variable characteristics, since a dictionary of early material is not of much use later in the text.
This use of variably increasing index sizes is one of Welch's contributions. Another was to specify an efficient data structure to store the dictionary.
26/04/2020 Applied Algorithmics - week7 18
LZW Compression - example
Fibonacci language: w -1 =a, w -2 =b, w i = w i-1 ·w i-2 for i>1 For example, w 6 = babbababbabba We show how LZW compresses babbababbabba
CW 0 CW 1 CW 2 CW 3 CW 4 CW 5
b a b a b b a b a b b a -2 -1
0 1 Virtual part In general: And in particular:
26/04/2020
2 3 4 5 6 CW 4 = CW 3 o First(CW 5 )
Applied Algorithmics - week7
7 11 12 CW i = CW j o First(CW i+1 ) and j
b
13
b
14
a
15
19
LZW Compression - example
cw -2
=
b cw -1
=
a cw 0
=
ba cw 1
=
ab cw 2
=
bb cw 3
=
bab cw 4
=
babb cw 5
=
babba
26/04/2020
cw -1 cw 1 a b a b cw 0 cw 3 b cw -2 b b cw 2
Applied Algorithmics - week7
cw 4 a cw 5
20
LZW Compression - compression stage
26/04/2020 Applied Algorithmics - week7 21
LZW Compression - compression stage
cw
;
while
( read next symbol
s
from
IN
)
if cw·s
exists in the dictionary
then cw
cw·s
;
else
add
cw·s
to the dictionary; save the index of
cw cw
s
; in
OUT
; 26/04/2020 Applied Algorithmics - week7 22
Decompression stage
Input IN – Compressed file of integers.
Output OUT – Decompressed file of characters. |IN| = Z – Size of the compressed file.
Copy all numbers from file
IN
to vector
V [256………..Z+255]
Create vector
F [256………..Z+255]
containing first characters of each code word Create vector
CW [256………..Z+255]
of all code words
for
i
=
256
to
Z+255
do if
V[i]
<
256 CW[i]
then
Concatenate(
char(V[i])
,
F[i+1]
)
else
CW[i]
Concatenate(
CW(V[i])
,
F[i+1]
) Write to the output file OUT all code words without their last symbols 26/04/2020 Applied Algorithmics - week7 23
LZW text compression
Theorem:
For any input string
S
LZW algorithm computes its compressed counterpart in time
O(n)
, where
n
is the length of
S
.
Sketch of proof: The most complex operations are performed on dictionary. With a help of hash tables all operations can be performed in linear time.
Also the decompression stage is linear.
26/04/2020 Applied Algorithmics - week7 24