Transcript Document

Introduction to
Information theory
A.J. Han Vinck
University of Duisburg-Essen
April 2012
content

Introduction

2
Entropy and some related properties

Source coding

Channel coding
First lecture






3
What is information theory about
Entropy or shortest average
presentation length
Some properties of entropy
Mutual information
Data processing theorem
Fano inequality
Field of Interest
Information theory deals with the problem of
efficient and reliable transmission of information
It specifically encompasses theoretical and applied aspects of
- coding, communications and communications networks
- complexity and cryptography
- detection and estimation
- learning, Shannon theory, and stochastic processes
4
Some of the successes of IT
• Satellite communications:
Reed Solomon Codes (also CD-Player)
Viterbi Algorithm
• Public Key Cryptosystems (Diffie-Hellman)
• Compression Algorithms
Huffman, Lempel-Ziv, MP3, JPEG,MPEG
• Modem Design with Coded Modulation ( Ungerböck )
• Codes for Recording ( CD, DVD )
5
OUR Definition of Information
Information is knowledge that can be used
i.e. data is not necessarily information
we:
1) specify a set of messages of interest to a receiver
2) and select a message to be transmitted
3) sender and receiver build a pair
6
Communication model
source
digital
Analogue
to digital
conversion
compression
/reduction
security
error
protection
from bit to
signal
7
A generator of messages: the discrete source
source X
Output x  { finite set of messages}
Example:
binary source: x  { 0, 1 } with P( x = 0 ) = p; P( x = 1 ) = 1 - p
M-ary source: x  {1,2, , M} with Pi =1.
8
Express everything in bits: 0 and 1
Discrete finite ensemble:
a,b,c,d  00, 01, 10, 11
in general: k binary digits specify 2k messages
M messages need log2M bits
Analogue signal: (problem is sampling speed)
1) sample and 2) represent sample value binary
11
v
10
01
9
t
00
Output
00, 10, 01, 01, 11
The entropy of a source
a fundamental quantity in Information theory
entropy
The minimum average number of binary digits needed
to
specify a source output (message) uniquely is called
“SOURCE ENTROPY”
10
SHANNON (1948):
1) Source entropy:=
M
M
i 1
i 1
  P(i) log 2 P(i)   P(i)(i)
= L
2) minimum can be obtained !
QUESTION: how to represent a source output in digital form?
QUESTION: what is the source entropy of text, music, pictures?
QUESTION: are there algorithms that achieve this entropy?
http://www.youtube.com/watch?v=z7bVw7lMtUg
11
Properties of entropy
A: For a source X with M different outputs:
log2M  H(X)  0
the „worst“ we can do is
just assign log2M bits to each source output
B: For a source X „related“ to a source Y:
H(X)  H(X|Y)
Y gives additional info about X
when X and Y are independent,
12
H(X) = H(X|Y)
Joint Entropy: H(X,Y) = H(X) + H(Y|X)
also


13
H(X,Y) = H(Y) + H(X|Y)

intuition: first describe Y and then X given Y

from this: H(X) – H(X|Y) = H(Y) – H(Y|X)
Homework: check the formel
Cont.

As a formel:
H(X, Y)    P( x, y) log P( x, y)    P( x, y) log P( x )P( y | x)
X,Y
X, Y
  P( x ) P( y | x ) log P( x )   P(x, y) log P( y | x )  H(X)  H(Y | X)
X
Y
H(X | Y)    P( x, y) log P( x | y)
X, Y
14
X,Y
Entropy: Proof of A
We use the following important inequalities
log2M = lnM log2e
ln x = y => x = ey
M-1
lnM
1-1/M
log2x = y log2e = ln x log2e
M
1
1
 nM  M  1
M
Homework: draw the inequality
15
Entropy: Proof of A
1
H(X)  log2 M  log2 e P( x ) ln
MP( x )
x
1
 log2 e P( x )(
 1)
MP( x )
x
1
 log2 e(  1)  0
x M
16
Entropy: Proof of B
H(X)  H(X | Y) 
P( x )
  log2 e  P( x , y) ln
P ( x | y)
x,y
P( x )
  log2 e  P( x , y)(
 1)
P ( x | y)
x,y
0
17
The connection between X and Y
X
Y
P(X=0)
P(Y=0|X=0)
Y=0
P(X  i, Y  j)  P(X  i)P(Y  j | X  i)
 P(Y  j)P(X  i | Y  j)
(Bayes)
P(X=1)
Y=1
P(Y= N-1|X=1)
•••
P(Y=1|X=M-1)
P(Y  j) 
P(Y= N-1|X=0)
P(X=M-1)
18
P(Y= N-1|X=M-1)
Y = N-1

P(X  i)P(Y  j | X  i)
X 0
•••
P(Y=0|X=M-1)
X  M 1

X  M 1
 P(X  i, Y  j)
X 0
Entropy: corrolary
H(X,Y)
= H(X) + H(Y|X)
= H(Y) + H(X|Y)
H(X,Y,Z)
= H(X) + H(Y|X) + H(Z|XY)
 H(X) + H(Y) + H(Z)
19
Binary entropy
n
1
lim log2    h(p)
n  n
 pn 
 n 
nh ( p )

2
 
 pn 
interpretation:
let a binary sequence contain pn ones, then
we can specify each sequence with
log2 2nh(p) = n h(p)
bits
Homework: Prove the approximation using ln N! ~ N lnN for N large.
Use also logax = y  logb x = y logba
The Stirling approximation 
20
N !  2 N N N e N
The Binary Entropy:
h
h(p) = -plog2p – (1-p) log2 (1-p)
1
Note:
0.9
h(p) = h(1-p)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
21
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
p
1
homework

Consider the following figure
Y
3
2
1
0
All points
22
1
2
3
X
are equally likely. Calculate H(X), H(X|Y) and H(X,Y)
Source coding
Two principles:
data reduction:
remove irrelevant data (lossy, gives errors)
data compression:
present data in compact (short) way (lossless)
original
data
remove
irrelevance
Relevant data
„original data“
23
compact
description
„unpack“
Transmitter side
receiver side
Shannons (1948) definition of transmission of information:
Reproducing at one point (in time or space) either exactly
or approximatelya message selected at another point
Shannon uses:
Binary Information digiTS (BITS)
n bits specify
M = 2n
different messages
OR
M messages specified by n =  log2 M bits
24
0 or 1
Example:
fixed length representation
00000  a
00001  b

11001  y
11010  z
- the alphabet: 26 letters,  log2 26 = 5 bits
- ASCII: 7 bits represents 128 characters
25
ASCII Table to transform our letters and signs into binary ( 7 bits = 128 messages)
26
ASCII stands for American Standard Code for Information Interchange
Example:

suppose we have a dictionary with 30.000 words

these can be numbered (encoded) with 15 bits

if the average word length is 5, we need „on the
average“ 3 bits per letter
01000100   
27
another example
Source
output a,b, or c
In
out
In
a
00
b
01
improve
c
10
efficiency
aaa
aab
aba

ccc
Efficiency = 2 bits/output symbol
Homework: calculate optimum efficiency
28
translate
output
binary
out
00000
00001 improve
00010
efficiency ?
11010
Efficiency = 5/3 bits/output symbol
Source coding (Morse idea)
Example: A system generates
the symbols X, Y, Z, T
with probability P(X) = ½; P(Y) = ¼; P(Z) = P(T) = 1/8
Source encoder:
X  0; Y  10; Z  110; T = 111
Average transm. length = ½ x 1 + ¼ x 2 +2 x 1/8 x 3 = 1¾ bit/s.
A naive approach gives X  00; Y  10; Z  11; T = 01
With average transm. length 2 bit/s.
29
Example: variable length representation of messages
C1
00
C2
1
letter
e
01
01
a
0.25
10
000
x
0.125
11
001
q
0.125
0111001101000…
frequency of occurence P(*)
0.5
aeeqea…
Note: C2 is uniquely decodable! (check!)
30
Efficiency of C1 and C2

Average number of coding symbols of C1
L  2  0.5  2  0.25  2  0.125 2  0.125  2

Average number of coding symbols of C2
L  1 0.5  2  0.25  3  0.125  3  0.125  1.75
C2 is more efficient than C1
31
Source coding theorem

Shannon shows that source coding algorithms exist
that have a


32
Unique average representation length that
approaches the entropy of the source
We cannot do with less
Basic idea cryptography
http://www.youtube.com/watch?v=WJnzkXMk7is
send
message
open
receive
cryptogram
closed
33
operation
secret
operation
secret
cryptogram
closed
message
open
Source coding in Message encryption (1)
Part 1
Part 2 ••• Part n
(for example every part 56 bits)
dependancy exists between parts of the message
•••
encypher
key
•••
n cryptograms,
•••
decypher
key
Part 1
34
Part 2
Part n
dependancy exists between cryptograms
Attacker:
n cryptograms to analyze
for particular message of n
parts
Source coding in Message encryption (2)
Part 1
Part 2 ••• Part n
n-to-1
key
(for example every part 56 bits)
source encode
encypher
1 cryptogram
decypher
Source decode
35
Part 1
Part 2 ••• Part n
Attacker:
- 1 cryptogram to analyze
for particular message
of n parts
- assume data compression
factor n-to-1
Hence, less material for the
same message!
Transmission of information





36
Mutual information definition
Capacity
Idea of error correction
Information processing
Fano inequality
mutual information I(X;Y):=
I(X;Y) := H(X) – H(X|Y)
= H(Y) – H(Y|X)
i.e.
( homework: show this! )
the reduction in the description length of X given Y
note that I(X;Y)  0
or: the amount of information that Y gives about X
equivalently:
I(X;Y|Z) = H(X|Z) – H(X|YZ)
the amount of information that Y gives about X given Z
37
3 classical channels
Binary symmetric
(satellite)
0
0
X
Y
1
1
0
erasure
Z-channel
(network)
(optical)
0
X
E
1
1
0
0
X
Y
1
1
Homework:
find maximum H(X)-H(X|Y) and the corresponding input distribution
38
Example 1

Suppose that X Є { 000, 001, , 111 } with H(X) = 3 bits

Channel:
X
Y = parity of X
channel

H(X|Y) = 2 bits: we transmitted H(X) – H(X|Y) = 1 bit of information!
We know that X|Y Є { 000, 011, 101, 110 } or X|Y Є { 001, 010, 001, 111 }
Homework:
39
suppose the channel output gives the number of ones in X.
What is then H(X) – H(X|Y)?
Transmission efficiency
Example: Erasure channel
1-e
½ 0
e
e
½ 1
1-e
H(X) = 1
0 (1-e)/2
E e
1 (1-e)/2
H(X|Y) = e
H(X)-H(X|Y) = 1-e = maximum!
40
Example 2



Suppose we have 2n messages specified by n bits
1-e
Transmitted :
0
0
e
E
1
1
1-e
After n transmissions we are left with ne erasures


41
Thus: number of messages we cannot specify = 2ne
We transmitted n(1-e) bits of information over the
channel!
Transmission efficiency
Easy obtainable when feedback!
0,1
erasure
0,1,E
0 or 1 received correctly
If Erasure, repeat until correct
R = 1/ T =1/ Average time to transmit 1 correct bit
= 1/ {(1-e) + 2e(1-e) + 3e2(1-e) +  }= 1- e
42
Transmission efficiency
I need on the average H(X) bits/source output to describe the source symbols X
After observing Y, I need H(X|Y) bits/source output
H(X)
X
Y
channel
H(X|Y)
Reduction in description length is called the transmitted information
Transmitted
R = H(X) - H(X|Y)
= H(Y) – H(Y|X) from earlier calculations
We can maximize R by changing the input probabilities.
The maximum is called CAPACITY (Shannon 1948)
43
Transmission efficiency

Shannon shows that error correcting codes exist that
have
 An efficieny k/n  Capacity


Decoding error probability  0


44
n channel uses for k information symbols
when n very large
Problem: how to find these codes
In practice:
45
Transmit
Receive
0 or 1
0 or 1
0
0
correct
0
1
in - correct
1
1
correct
1
0
in - correct
What
can
we do
about
it ?
Reliable: 2 examples
Transmit
Receive
A: = 0 0
0 0 or 1 1
OK
B: = 1 1
0 1 or 1 0
NOK
1 error detected!
A: = 0 0 0
B: = 1 1 1
46
000, 001, 010, 100  A
111, 110, 101, 011  B
1 error corrected!
Data processing (1)
X, Y and Z form a Markov chain: X  Y  Z
Let
and Z is independent from X given Y
i.e. P(x,y,z) = P(x) P(y|x) P(z|y)
X
P(y|x)
Y
P(z|y)
I(X;Y)  I(X; Z)
Conclusion: processing destroys information
47
Z
Data processing (2)
To show that: I(X;Y)  I(X; Z)
Proof: I(X; (Y,Z) )
=H(Y,Z) - H(Y,Z|X)
=H(Y) + H(Z|Y) - H(Y|X) - H(Z|YX)
= I(X; Y) + I(X; Z|Y)
I(X; (Y,Z) )
= H(X) - H(X|YZ)
= H(X) - H(X|Z) + H(X|Z) - H(X|YZ)
= I(X; Z) + I(X;Y|Z)
now I(X;Z|Y) = 0 (independency)
Thus: I(X; Y)  I(X; Z)
48
I(X;Y)  I(X; Z) ?
The question

is:
H(X) – H(X|Y)  H(X) – H(X|Z) or H(X|Z)

H(X|Y) ?
Proof:
1) H(X|Z)
-
H(X|Y)

H(X|ZY)
-
H(X|Y) (conditioning make H larger)
2) From: P(x,y,z) = P(x)P(y|x)P(z|xy) = P(x)P(y|x)P(z|y)
H(X|ZY) = H(X|Y)
3) Thus
49
H(X|Z)
-
H(X|Y)

H(X|ZY) = H(X|Y) = 0
Fano inequality (1)
Suppose we have the following situation: Y is the observation of X
X
p(y|x)
Y
decoder
Y determines a unique estimate X‘:
50
correct
with probability 1-P;
incorrect
with probability P
X‘
Fano inequality (2)
Since Y uniquely determines X‘, we have H(X|Y) = H(X|(Y,X‘))  H(X|X‘)
X‘ differs from X with probability P
Thus
for L experiments, we can describe X given X‘ by
firstly:
describe the positions where X‘  X with Lh(P) bits
secondly:
- the positions where X‘ = X do not need extra bits
- for LP positions we need  log2(M-1) bits to specify X
Hence, normalized by L:
51
H(X|Y)  H(X|X‘)  h(P) + P log2(M-1)
Fano inequality (3)
H(X|Y)  h (P) + P log2(M-1)
log2M
H(X|Y)
log2(M-1)
0
(M-1)/M
1
P
Fano relates conditional entropy with the detection error probability
Practical importance: For a given channel, with H(X|Y) the detection error
probability has a lower bound: it cannot be better than this bound!
52
Fano inequality (3): example
X  { 0, 1, 2, 3 }; P ( X = 0, 1, 2, 3 ) = (¼ , ¼ , ¼, ¼ )
X can be observed as Y
Example 1:
No observation of X
P= ¾;
Example 2:
x
53
H(X) = 2  h ( ¾ ) + ¾ log23
Example 3:
0
0
transition prob. = 1/3
0
0
transition prob. = 1/2
1
1
H(X|Y) = log23
1
1
H(X|Y) = log22
2
2
2
2
3
3
3
3
y
P > 0.4
x
y
P > 0.2
List decoding
Suppose that the decoder forms a list of size L.
PL is the probability of being in the list
Then
H(X|Y)  h(PL ) + PLlog 2L + (1-PL) log2 (M-L)
The bound is not very tight, because of log 2L.
Can you see why?
54
Fano ( http://www.youtube.com/watch?v=sjnmcKVnLi0 )
Shannon showed that it is possible to compress information. He produced examples
of such codes which are now known as Shannon-Fano codes.
Robert Fano was an electrical engineer at MIT (the son of G. Fano, the Italian
mathematician who pioneered the development of finite geometries and for whom
the Fano Plane is named).
Robert Fano
55
Application source coding: example MP3
Digital audio signals:
Without data reduction,
16 bit samples at a sampling rate 44.1 kHz for Compact Discs. 1.400 Mbit represent just one second of stereo
music in CD quality.
With data reduction:
MPEG audio coding, is realized by perceptual coding techniques addressing the perception of sound waves by
the human ear. It maintains a sound quality that is significantly better than what you get by just reducing the sampling
rate and the resolution of your samples.
Using MPEG audio, one may achieve a typical data reduction of
1:4
56
by Layer 1 (corresponds to 384 kbps for a stereo signal),
1:6...1:8
by Layer 2 (corresponds to 256..192 kbps for a stereo signal),
1:10...1:12
by Layer 3 (corresponds to 128..112 kbps for a stereo signal),