Chapter 20 Information Theory

Download Report

Transcript Chapter 20 Information Theory

Chapter 20 Information
Theory
20.1 Introduction
• Statistical thermodynamics provides the tool of
calculating entropy.
• Entropy is a measure of the degree of
randomness or disorder of a system.
• Disorder implies a lack of information regarding
the exact state of the system.
• A disordered system is one about which we lack
complete information.
20.2 Uncertainty and Information
• Claude Shannon laid down the foundations of
the information theory.
• Further developed by Leon Brillouin.
• Applied to Statistical thermodynamics by E. T.
Jaynes.
• The cornerstone of the Shannon theory is the
observation that information is a combination of
the certain and the uncertain, of the expected
and the unexpected.
• The degree of surprise generated by a
certain event – one that has already
occurred – is zero.
• If a less probable event is reported, the
information conveyed is greater.
• The information should increase as the
probability decreases.
• For a given experiment, consider a set of possible
outcomes whose probabilities are p1, p2, … pn.
• It is possible to find a quantity H(p1 . . . pn) that measures
in a unique way the amount of uncertainty represented
by the given set of probabilities.
• Only three conditions are needed to specify the function
H(p1 . . . pn) to within a constant factor. They are:
1. H is a continuous function of the p.
2. If all the pi’s are equal, pi = 1/n; then H(1/n,…, 1/n)
is a monotonic increasing function of n.
3. If the possible outcomes of a particular
experiment depend on the possible outcomes of n
subsidiary experiments, then H is the sum of the
uncertainties of the subsidiary experiments.
• The above discussion leads to g(R) + g(S) = g(RS), where
one can expect that the function g( ) shall be a logarithm function.
• In a general format, the function can be written as g(x) = A ln(x) + C,
where A and C are constants.
• From the earlier transformation g(x) = x f(1/x), one gets that the
uncertainty quantity, H, shall be (1/p)f(p) = A ln(p) + C, where p =
1/n (n is the total number of event).
• Therefore, f(p) = A*p*ln(p) + C
• Given that if the probability is 1, the uncertainty H must be zero, the
constant C should be equal to ZERO.
• Thus, f(p) = A*p*ln(p).
• Since p is smaller than 1, ln(p) shall be minus and thus the constant
A is inherently negative
• Following conventional notion, we write f(p) = -K*p*ln(p), where K is
a positive coefficient.
• The uncertainty quantity H(p1, p2, …pn) = Σ f(pi)
• Thus, H(p1, p2, …pn) = Σ –K*pi*ln(pi) = -K Σpi*ln(pi)
• Example:
H(1/2, 1/3, 1/6) = -K*[1/2ln(1/2) + 1/3*ln(1/3) + 1/6*ln(1/6)]
= -K*(-0.346 – 0.377 – 0.299) = 1.01K
from the decomposed procedure,
H(1/2,1/2) + 1/2H(2/3, 1/3) = -K*[1/2ln(1/2) + 1/2ln(1/2)]
-1/2*K*[2/3*ln(2/3) + 1/3ln(1/3)]
= -K(-0.346-0.346) – K/2*(-0.27 – 0.366)
= 1.01K
• For equal probable events, pi = 1/n, H = K*ln(n)
• In a binary case, where two possible outcomes of an experiment
with probabilities, p1 and p2 with p1 + p2 = 1
H = - K*[p1ln(p1) + p2ln(p2)]
• To determine H value when p1 is 0 or 1, one need L’Hopital’s rule
lim[u(x)/v(x)] as x approaches 0 equals lim[u’(x)/v’(x)]
• Therefore, as p1 approaches 0,
lim[p1ln(p1)] = lim[(1/x)/(-1/x2)] = 0
• The uncertainty is therefore 0 when either p1 or p2 is zero!
• Under what value of p1 while H reaches the maximum?
differentiate eq - K*[p1ln(p1) + p2ln(p2)] against p1 and set the
derivative equal 0
dH/dp1 = -K*[ln(p1) + p1/(p1) - ln(1- p1) – (1- p1)/ (1- p1)] = 0
which leads to p1 = 1/2
20.3 Unit of Information
• Choosing 2 as the basis of the logarithm
and take K = 1, one gets H = 1
• We call the unit of information a bit for
binary event.
• Decimal digit, H = log2(10) = 3.32, thus a
decimal digit contains about 3 and 1/3 bits
of information.
Linguistics
•
A more refined analysis works in terms of component syllables. One can
test what is significant in a syllable in speech by swapping syllables and
seeing if meaning or tense is changed or lost. The table gives some
examples of the application of this statistical approach to some works of
literature.
Linguistics
• The type of interesting results that arise from such
studies include:
(a) English has the lowest entropy of any major
language, and
(b) Shakespeare’s work has the lowest entropy of any
author studied.
• These ideas are now progressing beyond the scientific
level and are impinging on new ideas of criticism. Here
as in biology, the thermodynamic notions can be helpful
though they must be applied with caution because
concepts such as ‘quality’ cannot be measured as they
are purely subjective
Maximum entropy
• The amount of uncertainty:
n
H i  n   K  pi ln pi
i 1
• Examples on the connection between
entropy and uncertainty (gases in a
partitioned container)
• The determination of the probability that
has maximum entropy.
• Suppose one knows the mean value of some particular variable x,
   pi x
• Where the unknown probabilities satisfy the condition:
n
p
i
1
i 1
• In general there will be a large number of probability distributions
consistent with the above information.
• We will determine the one distribution which yields the largest
uncertainty (i.e. information).
• We need Lagrange multiplier to carry out the analysis.
H




0
j
j
j
• Where
• and
   pi  1
   pi  
i
i
• Then,
K
 
 
 



  pi ln pi   
  pi   
  pi  i   0
j  i
pi  i 
pi  i


j
 k ln j  k    j  0
j
• Solving for ln(pi):
ln pi     i
   i
pi  e
Determine the new Lagrange
multipliers λ and u

• So that
e

1
e
 i
j
 i
pi 
e 
e

i
j
• We define the partition function
z  e
 i
• Then
i
 i
pi  e
z
  ln z
Determine multiplier μ
z
 
   i e i

i
• We have
1 z

 
  ln z
z 

• Therefore,
H
max
  K  pi ln pi
j
K
 i


 k ln x

ie
Z
 K   ln z 
20.5 The connection to statistical
thermodynamics
• The entropy is define as
s  ln   ln N ! ln N !
j
k
j
 N ln N  N   N j ln N j   N j
j
j
 ln N j  N j   N j ln N j
j
j

  N j ln N  ln N j

j
 Nj
  N j ln 
 N
• Then



 Nj   Nj 
 ln 

S  kN  
j 1  N 
 N 
n
• A disordered system is likely to be in any number of different
quantum states. If Nj = 1 for N different states and Nj= 0 for all other
available states,
1 1
S  kN    ln    kN ln N
j 1  N 
N
n
• The above function is positive and increases with increasing N.
• Associating Nj/N with the probability pj
 kN 
S  kN  j ln j  
H
j
 K 
• The expected amount of information we would gain is a measure of
our lack of knowledge of the state of the system.
• Negative entropy (negentropy)
• The Boltzmann distribution for non-degenerate energy state
 J
kT
Nj
e

N
Z
 j
Z  e
• Where
kT
j
S
max
j
N1
   j e kT  Nk ln Z
T Z  j
j
N
 Nj 
U   Njj  N   j   j e kT
j
j  N 
Z
S max
U
  Nk ln Z
T
Summary
• Information theory is an extension of thermodynamics and
probability theory. Much of the subject is associated with the names
of Brillouin and Shannon. It was originally concerned with passing
messages on telecom munication systems and with assessing the
efficiency of codes. Today it is applied to a wide range of problems,
ranging from the analysis of language to the design of computers.
• In this theory the word ‘information’ is used in a special sense. Sup
pose that we are initially faced with a problem about which we have
no ‘information’ and that there are P possible answers. When we are
given some ‘information’ this has the effect of reducing the number
of possible answers and if we are given enough ‘information’ we
may get to a unique answer. The effect of increased information is
thus to reduce the uncer tainty about a situation. In a sense,
therefore, informatiouis the antithesis of entropy since entropy is a
measure of the randomness or disorder of a system. This contrast
led to the coining of the word rtegentropy to describe information.
• The basic unit of information theory is the bit—a shortened form of
‘binary digit’.
• For example, if one is given a playing card face
down without any information, it could be any
one of 52; if one is then told that it is an ace, it
could be any one of 4; if told that it is also a
spade, one knows for certain which card one
has. As we are given more information, The
situation becomes more certain. In general, to
determine which of the P possible outcomes is
realized, the required information I is defined as
H = K ln P