Language and Information September 21, 2000 Handout #2 (C) 2000, The University

Download Report

Transcript Language and Information September 21, 2000 Handout #2 (C) 2000, The University

September 21, 2000
Language and Information
Handout #2
(C) 2000, The University
of Michigan
1
Course Information
•
•
•
•
•
•
Instructor: Dragomir R. Radev ([email protected])
Office: 305A, West Hall
Phone: (734) 615-5225
Office hours: TTh 3-4
Course page: http://www.si.umich.edu/~radev/760
Class meets on Thursdays, 5-8 PM in 311 West Hall
(C) 2000, The University
of Michigan
2
Readings
• Textbook:
– Oakes, Chapter 2, pages 53 – 76
• Additional readings
– M&S, Chapter 7, pages (minus Section 7.4)
– M&S, Chapter 8, pages (minus Sections 8.3-4)
(C) 2000, The University
of Michigan
3
Information Theory
(C) 2000, The University
of Michigan
4
Entropy
• Let p(x) be the probability mass function of a
random variable X, over a discrete set of symbols
(or alphabet) X:
p(x) = P(X=x), x  X
• Example: throwing two coins and counting
heads and tails
• Entropy (self-information): is the average
uncertainty of a single random variable:
(C) 2000, The University
of Michigan
5
Information theoretic measures
• Claude Shannon (information theory):
“information = unexpectedness”
• Series of events (messages) with associated
probabilities: pi (i = 1 .. n)
• Goal: to measure the information content,
H(p1, …, pn) of a particular message
• Simplest case: the messages are words
• When pi is low, the word is less informative
(C) 2000, The University
of Michigan
6
Properties of information content
• H is a continuous function of the pi
• If all p are equal (pi = 1/n), then H is a
monotone increasing function of n
• if a message is broken into two successive
messages, the original H is a weighted sum
of the resulting values of H
(C) 2000, The University
of Michigan
7
Example
p1 = 1/2, p2 = 1/3, p3 = 1/6
• Only function satisfying all three properties
is the entropy function:
H=-
(C) 2000, The University
of Michigan
 p log
i
2
pi
8
Example (cont’d)
H = - (1/2 log2 1/2 + 1/3 log2 1/3 + 1/6 log2 1/6)
=
=
1/2 log2 2 + 1/3 log2 3 + 1/6 log2 6
1/2
+
1.585/3 + 2.585/6
=
1.46
Alternative formula for H:
H=
(C) 2000, The University
of Michigan
 p log
i
2
(1/pi)
9
Another example
• Example:
–
–
–
–
No tickets left:
Matinee shows only:
Eve. show, undesirable seats:
Eve. Show, orchestra seats:
(C) 2000, The University
of Michigan
P = 1/2
P = 1/4
P = 1/8
P = 1/8
10
Example (cont’d)
H = - (1/2 log 1/2 + 1/4 log 1/4 + 1/8 log 1/8 + 1/8 log 1/8)
H = - (1/2 x -1) + (1/4 x -2) + (1/8 x -3) + (1/8 x -3)
H = 1.75 (bits per symbol)
(C) 2000, The University
of Michigan
11
Characteristics of Entropy
• When one of the messages has a probability
approaching 1, then entropy decreases.
• When all messages have the same
probability, entropy increases.
• Maximum entropy: when P = 1/n (H = ??)
• Relative entropy: ratio of actual entropy to
maximum entropy
• Redundancy: 1 - relative entropy
(C) 2000, The University
of Michigan
12
Entropy examples
• Letter frequencies in Simplified Polynesian:
P(1/8), T(1/4), K(1/8), A(1/4), I (1/8), U (1/8)
• What is H(P)?
• What is the shortest code that can be designed to
describe simplified Polynesian?
• What is the entropy of a weighted coin? Draw a
diagram.
(C) 2000, The University
of Michigan
13
Joint entropy and conditional entropy
• The joint entropy of a pair of discrete random
variables
X, Y  p(x,y) is the amount of information
needed on average to specify both their values
H (X,Y) = -

x
y
p(x,y) log2 p(X,Y)
• The conditional entropy of a discrete
random variable Y given another X, for X,
Y  p(x,y) expresses how much extra
information is need to communicate Y given
H (Y|X) = -xy p(x,y) log2 p(y|x)
that
the other party knows X
(C) 2000,
The University
14
of Michigan
Connection between joint and
conditional entropies
• There is a chain rule for entropy (note that the
products in the chain rules for probabilities have
become sums because of the log):
H (X,Y) = H(X) + H(Y|X)
H (X1,…,Xn) = H(X1) + H(X2|X1) + … + H(Xn|X1,…,Xn-1)
(C) 2000, The University
of Michigan
15
Simplified Polynesian revisited
p
t
k
a 1/16 3/8 1/16 1/2
i
u
1/16 3/16
0
1/8
(C) 2000, The University
of Michigan
0
1/4
3/16 1/16 1/4
3/4
1/8
16
Mutual information
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
H(X) – H(X|Y) = H(Y) – H(Y|X) = I(X;Y)
• Mutual information: reduction in
uncertainty of one random variable due to
knowing about another, or the amount of
information one random variable contains
about another.
(C) 2000, The University
of Michigan
17
Mutual information and entropy
H(X,Y)
H(Y|X)
H(X|Y)
I(X;Y)
H(X|Y)
H(X|Y)
• I(X;Y) is 0 iff two variables are independent
• For two dependent variables, mutual information grows
not only with the degree of dependence, but also according
to the entropy of the variables
(C) 2000, The University
of Michigan
18
Formulas for I(X;Y)
I(X;Y) = H(X) – H(X|Y) = H(X) + H(Y) – H(X,Y)
I(X;Y) =

xyp(x,y) log2
p(x,y)
p(x)p(y)
Since H(X|X) = 0, note that H(X) = H(X)-H(X|X) = I(X;X)
p(x,y)
I(x;y) = log2 p(x)p(y)
(C) 2000, The University
of Michigan
: pointwise mutual information
19
The noisy channel model
W
X
Encoder
Message
from a
finite
alphabet
Input to
channel
0
Binary
symmetric
channel
Decoder
Output
from
channel
Attempt to
reconstruct
message
based on
output
0
p
1
(C) 2000, The University
of Michigan
1-p
Ŵ
Y
Channel
p(y|x)
1-p
1
20
Statistical NLP as decoding
problems
Application
Input
Output
p(i)
p(o|I)
Machine
translation
L1 word
sequences
L2 word
sequences
p(L1) in a
language model
Translation
model
Optical
character
recognition
Actual text
Text with
mistakes
Prob of
language text
Model of
OCR errors
Part of
Speech
tagging
POS tag
sequences
English words Prob of POS
sequences
p(w|t)
Speech
recognition
Word
sequences
Speech signal
Acoustic
model
(C) 2000, The University
of Michigan
Prob of word
sequences
21
Coding
(C) 2000, The University
of Michigan
22
Compression
• Huffman coding (prefix property)
• Ziv-Lempel codes (better)
• arithmetic codes (better for images - why?)
(C) 2000, The University
of Michigan
23
Huffman coding
• Developed by David Huffman (1952)
• Average of 5 bits per character
• Based on frequency distributions of
symbols
• Algorithm: iteratively build a tree of
symbols starting with the two least frequent
symbols
(C) 2000, The University
of Michigan
24
Symbol Frequency
(C) 2000, The University
of Michigan
A
7
B
4
C
10
D
5
E
2
F
11
G
15
H
3
I
7
J
8
25
0
0
1
1
0
1
g
0
c
1
0
f
1
0
1
0
b
d
a
(C) 2000, The University
of Michigan
0
1
i
j
1
0
1
e
h
26
Symbol Code
(C) 2000, The University
of Michigan
A
0110
B
0010
C
000
D
0011
E
01110
F
010
G
10
H
01111
I
110
J
111
27
Exercise
• Consider the bit string:
011011011110001001100011101001110001
10101101011101
• Use the Huffman code from the example to
decode it.
• Try inserting, deleting, and switching some
bits at random locations and try decoding.
(C) 2000, The University
of Michigan
28
Ziv-Lempel coding
• Two types - one is known as LZ77 (used in
GZIP)
• Code: set of triples <a,b,c>
• a: how far back in the decoded text to look
for the upcoming text segment
• b: how many characters to copy
• c: new character to add to complete segment
(C) 2000, The University
of Michigan
29
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
<0,0,p>
<0,0,e>
<0,0,t>
<2,1,r>
<0,0,_>
<6,1,i>
<8,2,r>
<6,3,c>
<0,0,k>
<7,1,d>
<7,1,a>
<9,2,e>
<9,2,_>
<0,0,o>
<0,0,f>
<17,5,l>
<12,1,d>
<16,3,p>
<3,2,r>
<0,0,s>
p
pe
pet
peter
peter_
peter_pi
peter_piper
peter_piper_pic
peter_piper_pick
peter_piper_picked
peter_piper_picked_a
peter_piper_picked_a_pe
peter_piper_picked_a_peck_
peter_piper_picked_a_peck_o
peter_piper_picked_a_peck_of
peter_piper_picked_a_peck_of_pickl
peter_piper_picked_a_peck_of_pickled
peter_piper_picked_a_peck_of_pickled_pep
peter_piper_picked_a_peck_of_pickled_pepper
peter_piper_picked_a_peck_of_pickled_peppers
(C) 2000, The University
of Michigan
30
No. of triples
(C) 2000, The University
of Michigan
Average text
length
No. of code
triples
Average text
length
1
1.00
11
1.82
2
1.00
12
1.92
3
1.00
13
2.00
4
1.25
14
1.93
5
1.20
15
1.87
6
1.33
16
2.13
7
1.57
17
2.12
8
1.88
18
2.22
9
1.78
19
2.26
10
1.80
20
2.20
31
Arithmetic coding
• Uses probabilities
• Achieves about 2.5 bits per character
(C) 2000, The University
of Michigan
32
Symbol Initial
a
1/5
After
a
2/6
b
1/5
1/6
2/7
2/8
2/9
2/10
2/11
c
1/5
1/6
1/7
1/8
2/9
2/10
2/11
s
1/5
1/6
1/7
1/8
1/9
1/10
2/11
u
1/5
1/6
1/7
1/8
1/9
2/10
2/11
Upper
Bound
1.000
0.200
0.1000
0.076190 0.073809 0.073809 0.073795
Lower
Bound
0.000
0.000
0.0666
0.066666 0.072619 0.073767 0.073781
(C) 2000, The University
of Michigan
After
ab
2/7
After
aba
3/8
After
abac
3/9
After After
abacu abacus
3/10
3/11
33
Exercise
• Assuming the alphabet consists of a, b, and
c, develop arithmetic encoding for the
following strings:
aaa
aba
abc
cba
(C) 2000, The University
of Michigan
aab
baa
cab
bac
34