Transcript Slides

CMSC 100
Storing Data: Huffman Codes and Image Representation
Professor Marie desJardins
Tuesday, September 18,
2012
CMSC 100 -- Data Compression
1
Tue 9/18/12
Data Compression:
Motivation

Memory is a finite resource: the more data we have, the more space
it takes to store


Same with bandwidth: the more data we need to send, the more time it
takes
Data compression can reduce space and bandwidth

Lossless compression: Store the exact same data in less space

Lossy compression: Store an approximation of the data in less space
2
CMSC 100 -- Data Compression
Tue 9/18/12
Time and Space Tradeoffs


Data compression trades (computational) time for space and
bandwidth:

It takes time to convert the original data D to the compressed format DC

It takes time to convert compressed data DC back to a viewable format D’
Compression ratio:
Length(DC )
CR 
Length(D)

3
Space savings:

CMSC 100 -- Data Compression
SS  1  CR
Tue 9/18/12
Lossless vs. Lossy
Compression

Lossless: Save space without losing any information


Take advantage of repetition and self-similarity (e.g., solid-color regions in
an image)
Lossy: Save space but lose some information

Lose resolution or detail (e.g., “pixillate” an image or remove very
high/low frequencies in a sound file)
4
CMSC 100 -- Data Compression
Tue 9/18/12
Encoding Strategies

Run-length encoding: replace n instances of object x with the pair of
numbers (n,x)

Frequency-dependent encoding: use shorter representations (fewer
bits) for objects that appear more frequently in a document

Relative or differential encoding: when x is followed by y, represent y
by the difference y-x (which is often small in images etc. and can
therefore be represented by a short code)

Dictionary encoding: Create an index of all of the objects (e.g.,
words) in a document, then replace each object with its index
location (can save space if there is a lot of repetition)
5
CMSC 100 -- Data Compression
Tue 9/18/12
Image and Sound Formats

Images



Sound


6
Row-by-row bitmaps in different color spaces:
 RGB (one byte per color = 24 bits = 17M different colors), a.k.a “True
Color” (used in JPEG formats) (How much storage for one True Color
2Kx3K digital camera image?)
 Color palette: Use only one byte to index 256 of the 17M 24-bit colors
(used in GIF formats) (How much storage for one 24-bit color
200x300 image on a website?)
Variable resolution provides different image sizes and levels of fidelity to
an original (continuous or very high-resolution digital) image
Convert continuous sound to digital by sampling (variable-rate)
Each sample can be represented with varying levels of resolution (“bit
depth”) (MP3: 44K samples/second, 16 bits/sample – how much storage
for one minute of sound?)
CMSC 100 -- Data Compression
Tue 9/18/12
Compression Ratio:
Example

Suppose I have a 2M .PNG (bitmap) image and I store it in a
compressed .JPG file that is 187K. What is the compression ratio?
What is the space savings?
7
CMSC 100 -- Data Compression
Tue 9/18/12
Huffman Coding


Lossless frequency-based encoding

Huffman coding is (space-)optimal in the sense that if we need the exact
distribution (frequency) of every object, we will be able to represent the
document in the shortest possible number of bits

Downside: It takes a while to compute
Goal #1: Length of each object should be related to its frequency


8
Specifically: length is proportion to the negative log of the frequency
Goal #2: Code should be unambiguous

Since objects will be encoded at different lengths, as we read the bits, we
need to know when we’ve reached the end of one object and should begin
processing the next one

This type of code is called a prefix code
CMSC 100 -- Data Compression
Tue 9/18/12
Using a Prefix Code
How would you represent
“HELLO” using this code?
0
0
0
1
A
0
H
1
1
E
1
L
0
1
O
0
9
Note: By convention, the left branch is 0;
the right branch is 1
CMSC 100 -- Data Compression
1
C
S
Tue 9/18/12
Interpreting a Prefix Code
0
0
0
1
A
0
H
1
E
1
L
What does “1110000110110111110”
mean in this code?
10
CMSC 100 -- Data Compression
1
0
1
O
0
1
C
S
Tue 9/18/12
Interpreting a Prefix Code
0
0
0
1
A
0
H
11
1
E
1
L
C
What does “1110000110110111110”
mean in this code?
CMSC 100 -- Data Compression
1
0
1
O
0
1
C
S
Tue 9/18/12
Interpreting a Prefix Code
0
0
0
1
A
0
H
12
1
E
1
L
C
What does “1110 | 000110110111110”
mean in this code?
CMSC 100 -- Data Compression
1
0
1
O
0
1
C
S
Tue 9/18/12
Interpreting a Prefix Code
0
0
0
1
A
0
H
13
1
E
1
L
C
H
What does “1110 | 000110110111110”
mean in this code?
CMSC 100 -- Data Compression
1
0
1
O
0
1
C
S
Tue 9/18/12
Interpreting a Prefix Code
0
0
0
1
A
0
H
14
1
L
1
1
E
0
1
O
C
H
O
O
S
E
What does “1110 | 000 | 110 | 110 | 1111 | 10”
mean in this code?
CMSC 100 -- Data Compression
0
1
C
S
Tue 9/18/12
0
0
0
0
1
L
1
0
1
1
0
1
0 1
A SPC
T
0 1
O
0
Y
0
1
1
0
W
E
0
Decode the Message:
1
!
1
C
0
1
0
1
M
P
S
U
1
R
0111110010100101011011100011110111110110 010 00111111110 010
15
0110001110 010 0110001110 010 0110001110 010
0001100000100100000000110 010 011111001000000 01110
CMSC 100 -- Data Compression
Tue 9/18/12
Encoding Algorithm



Frequency distribution:

Set of k objects, o1...ok

Number of times of each object appears in the document, n1...nk
Construct a Huffman code as follows:
1.
Pick the two least frequent objects, oi and oj
2.
Replace them with a single combined object, oij, with frequency ni+nj
3.
If there are at least two objects left, go to step 1
Visually:

Each of the original objects is a leaf (bottom node) in the prefix tree

Each combined objects represents a 0/1 split where the “children” are the
two objects that were combined

In the last step, we combine two subtrees into a single final prefix tree
16
CMSC 100 -- Data Compression
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE
17
CMSC 100 -- Data Compression
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE

Frequency distribution:
18

A–2

B–1

E–7

H–4

L–4

O–1

R–1

S–8

T–1

Y–1

<SPC> – 5
CMSC 100 -- Data Compression
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE

Frequency distribution:
19

A–2

B–1

E–7

H–4

L–4

O–1

R–1

S–8

T–1

Y–1

<SPC> – 5
CMSC 100 -- Data Compression
2
B1
O1
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE

Frequency distribution:
20

A–2

B–1

E–7

H–4

L–4

O–1

R–1

S–8

T–1

Y–1

<SPC> – 5
CMSC 100 -- Data Compression
2
B1
2
O1
R1
3
T1
A2
Y1
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE

Frequency distribution:
21

A–2

B–1

E–7

H–4

L–4

O–1

R–1

S–8

T–1

Y–1

<SPC> – 5
CMSC 100 -- Data Compression
7
4
2
B1
2
O1
R1
3
T1
A2
Y1
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE

Frequency distribution:
22

A–2

B–1

E–7

H–4

L–4

O–1

R–1

S–8

T–1

Y–1

<SPC> – 5
CMSC 100 -- Data Compression
7
8
4
H4
2
B1
2
O1
R1
L4
3
T1
A2
Y1
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE

Frequency distribution:
23

A–2

B–1

E–7

H–4

L–4

O–1

R–1

S–8

T–1

Y–1

<SPC> – 5
CMSC 100 -- Data Compression
12
_5
7
4
E7
H4
2
B1
8
2
O1
R1
L4
3
T1
A2
Y1
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE

Frequency distribution:
24

A–2

B–1

E–7

H–4

L–4

O–1

R–1

S–8

T–1

Y–1

<SPC> – 5
CMSC 100 -- Data Compression
15
12
_5
7
4
E7
H4
2
B1
8
2
O1
R1
L4
3
T1
A2
Y1
Tue 9/18/12
Encoding Example

SHE SELLS SEASHELLS BY THE SEASHORE
35

25
Frequency distribution:

A–2

B–1

E–7

H–4

L–4

O–1

R–1

S–8

T–1

Y–1

<SPC> – 5
CMSC 100 -- Data Compression
20
15
12
_5
S8
8
4
E7
H4
2
B1
7
2
O1
R1
L4
3
T1
A2
Y1
Tue 9/18/12
Green Eggs and Ham
26
CMSC 100 -- Data Compression
Tue 9/18/12
Green Eggs and Ham
Symbols (not letters!) are words.
Ignore spaces and punctuation.
I am Sam
I am Sam
Sam I am
27
Do you like
green eggs and ham?
That Sam-I-am!
I do not like them,
That Sam-I-am!
Sam-I-am.
I do not like
I do not like
that Sam-I-am!
CMSC 100 -- Data Compression
green eggs and ham.
Tue 9/18/12