Part I: Introduction

Download Report

Transcript Part I: Introduction

Lecture 4: Lossless Compression(1)
Hongli Luo
Fall 2011
Lecture 4: Lossless Compression (1)
 Topics (Chapter 7)
 Introduction
 Basics of Information Theory
 Compression techniques
• Lossless compression
• Lossy compression
4.1 Introduction
 Compression: the process of coding that will
effectively reduce the total number of bits needed to
represent certain information.
Introduction
 Compression ratio:


B0 - number of bits before compression
B1 - number of bits after compression
Types of Compression
 Lossless compression
 Does not lose information – the signal can be perfectly
reconstructed after decompression
 Produces a variable bit-rate
 It is not guaranteed to actually reduce the data size
• Depends on the characteristics of the data

Example : Winzip
 Lossy compression
 Loses some information – the signal is not perfectly
reconstructed after decompression
 Produces any desired constant bit-rate
 Exampe: JPEG, MPEG
4.2 Basics of Information Theory
 Model Information at the Source



Model data at the Source as a Stream of Symbols –This
defines the “Vocabulary” of the source.
Each symbol in the vocabulary is represented by bits
If your vocabulary has N symbols, each symbol
represented with log2N bits.
•
•
•
•
Text by ASCII code – 8bits/code: N=28=256 symbols
Speech -16 bits/sample: N=216=65,536 symbols
Color Image: 3x8 bits/sample: N=224=17x106 symbols
8x8 Image Blocks: 8x64 bits/block: N=2512=1077 symbols
Lossless Compression
 Lossless compression techniques ensure no loss of data after
compression/decompression.
 Coding: “Translate” each symbol in the vocabulary into a “binary
codeword”. Codewords may have different binary lengths.
 Example: You have 4 symbols (a, b, c, d). Each in binary may
be represented using 2 bits each, but coded using a different
number of bits.




a(00) -> 000
b(01) -> 001
c(10) -> 01
d(11) -> 1
 Goal of Coding is to
minimize the average symbol length
Average Symbol Length
 The vocabulary of the source has N symbols
l(i) – binary length of ith symbols
 Symbol i has been emitted m(i) times
 M = number of symbols that the source emits (on every T second)

 Number of bits been emitted in T second
 Probability P(i) of a symbol: number of times it occurs in the transmission and
is defined as P(i) = m(i) /M
Average Symbol Length
 Average length per symbol / average symbol length
 Average bit rate
Minimum Average Symbol Length
 Goal of compression
 to minimize the number of bits being transmitted
 Equivalent to minimize the average symbol length
 How to reduce the average symbol length
 Assign shorter codewords to symbols that appear more
frequently,
 Assign longer codewords to symbols that appear less
frequently
Minimum Average Symbol Length
 What is the lower bound of average symbol length?
 Decided by the entropy
 Shannon’s Information Theorem
 The average binary length of the encoded symbol is
always greater than or equal to the entropy H of the
source
Entropy
entropy η of an information source with alphabet S =
{s1, s2, …, sn} is:
 The
pi - probability that symbol si will occur in S.
indicates the amount of information ( selfinformation as defined by Shannon) contained in si,
which corresponds to the number of bits needed to
encode si.
Entropy
 The entropy is characteristics of a given source of
symbols
 Entropy is largest (equal to log2N) when all symbols
are equally probable
 Entropy is small (always >=0) when some symbols
are much more likely to appear than other symbols
 The chances that each symbol appear are similar, or
symbols are uniformly distributed in the source
Entropy and code length
 The entropy represents the
average amount of
information contained per symbol in the source S.
 The entropy
species the lower bound for the
average number of bits to code each symbol in S, i.e.,
- the average length (measured in bits) of the
codewords produced by the encoder.
 Efficiency of the Encoder

Distribution of Gray-Level Intensities
Fig. 7.2(a) shows the histogram of an image with uniform
distribution of gray-level intensities, i.e., pi = 1/256.
Hence, the entropy of this image is:
log2 256 = 8
(7.4)
- No compression is possible for this image!
4.3 Compression Techniques
 Compression techniques are broadly classified into
 Lossless compression
• Run-length encoding
• Variable length coding (entropy coding):
– Shannon-fano algorithm,
– Huffman coding,
– adaptive Huffman coding
• Arithmetic coding
• LZW

Lossy compression
Run-length Encoding
 Sequence of elements, c1, c2, …, ci,…, is mapped to a
run (ci,li)


Ci = symbol
li = length of the symbol ci’s run
 For example, given the sequence of symbols
{1,1,1,1,3,3,4,4,4,3,3,5,2,2,2}
The run-length encoding is
(1,4),(3,2),(4,3),(3,2),(5,1),(2,3)
 Apply run-length encoding to a bi-level image (with
only 1-bit black and while pixels)
 Assume the starting run is of a particular color (either
black or white)
 Code the length of each run
Variable Length Coding
 VLC generates variable length codewords from fixed
length
 VLC is one of the best known entropy coding method
 Methods of VLC



Shannon-Fano algorithm
Huffman coding
Adaptive Huffman coding
Shannon-Fano Algorithm
A top-down approach, Steps:
1. Sort the symbols according to the frequency count of their
occurrences.
2. Recursively divide the symbols into two parts, each with
approximately the same number of counts, until all parts contain
only one symbol.
An Example: coding of “HELLO”
Sort symbols according to their frequencies, LHEO
Assign bit 0 to its left branches and 1 to the right branches.



Coded bits: 10 bits
Raw datawords, 8 bits per character: 40 bits
Compression ratio : 10/40 = 25%