Advancing the Vibrancy of Computing

Download Report

Transcript Advancing the Vibrancy of Computing

Introduction to Language Modeling
Alex Acero
Acknowledgments
• Joshua Goodman, Scott MacKenzie for many
slides
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Probability – definition
P(X) means probability that X is true
P(baby is a boy)  0.5 (% of total that are boys)
P(baby is named John)  0.001 (% of total named John)
Baby boys
John
Babies
Joint probabilities
• P(X, Y) means probability that X and Y are both true, e.g.
P(brown eyes, boy)
Baby boys
John
Brown eyes
Babies
Conditional probabilities
• P(X|Y) means probability that X is true when we already
know Y is true
– P(baby is named John | baby is a boy)  0.002
– P(baby is a boy | baby is named John )  1
Baby boys
John
Babies
Bayes rule
P(X|Y) = P(X, Y) / P(Y)
P(baby is named John | baby is a boy)
=P(baby is named John, baby is a boy) / P(baby is a boy)
= 0.001 / 0.5 = 0.002
Baby boys
John
Babies
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
90% Removed
k
t
r ce
a
,
d
r
oz
e
o a
a
a
l
m
d
f
n
w
a
n
v
t
i
s
n
l
h
-
w
le
a
r
r
l
o
ss
e
h
r d
c
g
a
i t
s
h
b a
e
t
e
t
o
y
i
e
w
Wi
w
e
f
n
,
t
h
e
v
e
80% Removed
D r
s r
s
n
h
s d
f
e
b e
i
e
in
th
a
t e
t ar
e
t e
i
t i
c
i
.
as
d
on,
ve n
o
n an
o
h t h sp r
f
of
s
a
h
o
a
e
n
a n s au
h
o
h
l
e
ld a
e
i f li
w
h mas
u i
e i
m
g in
t e
o
t w
t
a
a
N
n
d
w
w i
a
u
e
r
k
ver
ho
e
g
e
o
m r
s
n
s
i i
o
z
o
e
s
in
i
70% Removed
D
t
fro
uc
ore
frown d on i h r id
n
ter a
h
ee
a
ent
o
e
whi e
in
f
y seeme t l a
s ac
c
s
e ad
A
l
ei
d
h
h
lf wa
s a
s,
th
mo
t,
a
ha th
p
o
v
at
sa
s
e
a a i
i
g
t
aught
m r t
h
y
ne s
a
g
t at wa m
es
e
e o
i
laug
s
e fr s
p
k
f h gr nes
i f
t e
fu an
i c
u
l
s m of
r
i
u
l f
a
he ef rt o
il ,
v
, ro
ar
t land il
60% Removed
k
ruc
res f w
n
th r id t e fr z n
t r y. T
r
d
stri
d b a rece t
n
t
w i
cove n o
r s
a d t e
e e to e n
h ot er, l
o no s
e ad ng l g . A
st
n
eig d v
e
l d.
h
n
t e
w
a d s
ti , i
es , wi
mo
e t, so
a d co
t a
pi it
i
as
ev
hat
dn ss
he
a
n it
a ht r b t f
l u
e te
bl t
n
ess
a
u
e t
r
e
s
the sm e
t
p i
ugh
o d a
he fr
d p ta in
h g
ne s
llib l
It as
e a
ful
d
mm
c le w
e nity lau h g a t e
l t o
and
ff t
. I
a
he W d
s
ge fr z
a e N th a
W ld
50% Removed
D k
r ce fore t
o
on i h r si
the fr en wa erw
tr s ha
ee
i ed y a ec nt wi d f the r whit
erin o
f
an
e
e med t lean t war
ch o er,
ck a d
mi ou
in
g i t.
t
le
r ig d
r he
l d
la d t el
as
d so
i
if
, w t o
m v ent, s lone
d
d hat h spi t f t was
t ven th
sa e s
h
as a h nt n i
f aug t ,
t f
l u t
mo e ter ble than an
a
ss lau ter hat as m
h
as
t
s ile f he ph n
a
ghte
ol as he
os a d
r a ing
e grimn ss o
n a
il y.
a
h m s er u
in om
a
i d
e it laug n a t
fu i
y f l e
he ff t of if . It
he W l , t e s ag , f
e ed N r
n
ld
40% Removed
Dark s r c
res
ow e
n eit er side the froze w erwa . T
trees h d b n tr p d b
c t w n o
heir white c v ring of
f s
and h y s med
e n towa ds e c o her
la k and
i o s, i
e f
l gh . A as
ilence reig ed o
he
and. The lan
e f a a de l tion, life e s, i
mo ment, so
n
nd cold h
the pi t o it
s n t ev n hat
o sa nes .
h e wa
hin
n it of laughter, bu
a lau h r
e t rib e
a
dn
laught r
a was mi hle s as
the
i e of t e s
a la hte
ol as he ro t a d ar ak n
f th g mne
of inf llib i . It wa
he mas erful n
incom un
b e w s om of t rnity l ugh n at
e
lity f
fe
nd
e effort f ife. I
as t e
,
savage, fro nh a
r hland Wi d.
30% Removed
Da
t
s ruce fores
r
on ith r i e the froz n ater ay
The
s ha
een st ippe b
re e
wi d f heir white c vering of
ost, an they e
d o ean towards eac othe , lac a d
om nou , n t e f in
ight. A vast il nc re ned o e
he
la d. The l d
s l wa
d s a ion, if
s, without
m
men
o
e and old that t e pi t of it was not e en tha
o sadnes
T ere
s a hint in t of laug
r, ut o a laugh e
o
er ble t n a y sadn s
a la ghte
at as irthless s
the mil o
h
phi x, a
ghter cold as the fro t a d par ak ng
f the grim es of n al b l ty
was th m s r ul and
co mu i able wi
m of ter
la ghi
t t e futil ty o
if
an the ef rt of li e. It as the Wild, h
avage, froze hearte Northland Wil .
20% Removed
Dark spruce forest frowned n either side the roze
ate wa . The
trees had ee stripp d by
ecent wi d f thei white coverin of
frost
d they eemed to lean towa ds e ch o h r, bl ck and
mino s, in the ad ng l ght. A vas sil n e reigned over the
land. The land i sel was
e olatio , ifeless, wit out
movem n , s lon
n cold ha
e spi it of
s n t eve
hat
adn s. The e was
hi
i it of a ght , but f a laughter
ore
ible t an any s ne s - a laughter t a was mi hless as
he mile of he sphinx, a aug ter col as the f ost nd arta in
of th grimn ss of i fall bility
It as the asterful and
inc mmunicabl wisdo of e ernity
ugh ng at the futilit of li
and t e
for of ife. It w
the Wild, the avage, rozenhea t d N rthla d W ld.
10% Removed
Dark s ru e forest frowned on either side the frozen waterw y. The
trees had bee stripped by a recent ind of t eir w ite covering of
rost, and they seemed o lean towards each ot er, black and
ominous, in the fading li h . A vast silence reigned ver the
land
The land itself w s a deso ation, lifel ss, without
ovement, s l n and cold hat the spiri of it wa not even that
of sa ness.
here was a hint in it of laughte , but of a l ug ter
more terrible than any sadness - a
ughte that as mirthless s
he smile of the sphinx a laug ter cold as the rost a d p rt k ng
of the grimness of infal i ility. It as the masterful and
incommunica le wisdom of eternity laughing at th futility of life
an the effort f life. It was
e Wild, the sa ag , froz nearte Northland Wild.
0% Removed
Dark spruce forest frowned on either side the frozen waterway. The
trees had been stripped by a recent wind of their white covering of
frost, and they seemed to lean towards each other, black and
ominous, in the fading light. A vast silence reigned over the
land. The land itself was a desolation, lifeless, without
movement, so lone and cold that the spirit of it was not even that
of sadness. There was a hint in it of laughter, but of a laughter
more terrible than any sadness - a laughter that was mirthless as
the smile of the sphinx, a laughter cold as the frost and partaking
of the grimness of infallibility. It was the masterful and
incommunicable wisdom of eternity laughing at the futility of life
and the effort of life. It was the Wild, the savage, frozenhearted Northland Wild.
From Jack London’s “White Fang”
Language as Information
•
•
•
•
Information is commonly measured in “bits”
Since language is highly redundant, perhaps it can be
viewed somewhat like information
Can language be measured or coded in “bits”?
Sure. Examples include…
– ASCII (7 bits per “symbol”)
– Unicode (16 bits per symbol)
– But coding schemes, such as ASCII or Unicode, do not account
for the redundancy In the language
•
Two questions:
1. Can a coding system be developed for a language (e.g., English)
that accounts for the redundancy in the language?
2. If so, how may bits per symbol are required?
How many bits?
• ASCII codes have seven bits
• 27 = 128 codes
• Codes include…
– 33 control codes
– 95 symbols, including 26 uppercase letters, 26 lowercase letters,
space, 42 “other” symbols
• In general, if we have n symbols, the number of bits to
encode them is log2n (Note: log2128 = 7)
• What about bare bones English – 26 letters plus space?
• How many bits?
How many bits? (2)
• It takes log227 = 4.75 bits/character to encode bare bones
English
• But, what about redundancy in English?
• Since English is highly redundant, is there a way to encode
letters in fewer bits?
– Yes
• How many bits?
– The answer (drum roll please)…
How many bits? (3)
• The minimum number of bits to encode English is
(approximately)…
– 1 bit/character
• How is this possible?
– E.g., Huffman coding
– ngrams
• More importantly, how is this answer computed?
• Want to learn how? Read…
Shannon, C. E. (1951). Prediction and entropy
of printed English. The Bell System Technical
Journal, 30, 51-64.
Disambiguation
• A special case of prediction is disambiguation
• Consider the telephone keypad…
• Is it possible to enter text using this keypad?
• Yes. But the keys are ambiguous.
Ambiguity Continuum
53 keys
27 keys
8 keys
1 key
Less
Ambiguity
More
Ambiguity
Coping With Ambiguity
• There are two approaches to disambiguating the telephone
keypad
– Explicit
• Use additional keys or keystrokes to select the
desired letter
• E.g., multitap
– Implicit
• Add “intelligence” (i.e., a language model) to the
interface to guess the intended letter
• E.g., T9, Letterwise
Multitap
• Press a key once for the 1st letter, twice for the 2nd letter, and
so on
• Example...
84433.778844422255.22777666966.33366699.
th e q u i c k b r o wn f o x
58867N7777.66688833777.84433.55529999N999.36664.
ju mp s
o v e r
th e l az
y
do g
But, there is a problem. When consecutive letters are on the same
key, additional disambiguation is needed. Two techniques: (i)
timeout, (ii) a special “next letter” (N) key1 to explicitly segment the
letters. (See above: “ps” in “jumps” and “zy” in “lazy”).
1
Nokia phones: timeout = 1.5 seconds, “next
letter” = down-arrow
T9
• Product of Tegic Communications (www.tegic.com), a
subsidiary of Nuance Communications
• Licensed to many mobile phone companies
• The idea is simple:
– one key = one character
• A language model works “behind the scenes” to disambiguate
• Example (next slide)...
Guess the Word
C
O
M
P
U
T
E
R
Number of word stems to consider…
3
x
3
x
3
x
4
x
3
x
3
x
3
x
4 =
11,664
“Quick Brown Fox” Using T9
843.78425.27696.369.58677.6837.843.5299.364.
the quick brown fox jumps over the lazy dog
But, there is problem. The key sequences
are ambiguous and other words may exist
for some sequences. See below.
843.78425.27696.369.58677.6837.843.5299.364.
the quick brown fox jumps over the jazz dog
tie stick crown
lumps muds tie lazy fog
vie
vie
Decreasing
Probability
Keystrokes Per Character (KSPC)
• Earlier examples used the “quick brown fox” phrase, which
has 44 characters (including one character after each word)
• Multitap and T9 require very different keystroke sequences
• Compare…
Number of
Keystrokes
Number of
Characters
Keystrokes per
Character
Qwerty
44
44
1.000
Multitap
88
44
2.000
T9
45
44
1.023
Method
Phrase: the quick brown fox jumps over the lazy dog.
Formulas
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Speech Recognition
TTS
ASR
SLG
SLU
DM
Acoustic
Model
Input
Speech
Feature
Extraction
Pattern
Classification
(Decoding,
Search)
Language
Model
Word
Lexicon
Confidence
Scoring
“Hello World”
(0.9) (0.8)
Language Modeling in ASR
Words  arg max p(Words | Acoustics)
Words
 arg max p( Acoustics| Words) p(Words)
Words
• Some sequences of words sounds alike, but not all of them
are good English sentences.
– I went to a party
– Eye went two a bar tea
Rudolph the red nose reindeer.
Rudolph the Red knows rain, dear.
Rudolph the Red Nose reigned here.
Language Modeling in ASR
• This lets the recognizer make the right guess when two different
sentences sound the same.
For example:
• It’s fun to recognize speech?
• It’s fun to wreck a nice beach?
Humans have a Language Model
Ultimate goal is that a speech recognizer performs a good as human
being.
In psychology a lot of research has been done.
• The *eel was on the shoe
• The *eel was on the car
People capable to adjusting to right context
• removes ambiguities
• limits possible words
Already very good language models for dedicated applications (e.g.
medical, a lot of standardization)
A bad language model
A bad language model
A bad language model
A bad language model
What’s a Language Model
• A Language model is a probability distribution over word
sequences
• P(“And nothing but the truth”)  0.001
• P(“And nuts sing on the roof”)  0
What’s a language model for?
• Speech recognition
• Machine translation
• Handwriting recognition
• Spelling correction
• Optical character recognition
• Typing in Chinese or Japanese
• (and anyone doing statistical modeling)
How Language Models work
Hard to compute P(“And nothing but the truth”)
Step 1: Decompose probability
P(“And nothing but the truth” )
= P(“And”) P(“nothing” | “And”) x P(“but” | “And nothing”)
x P(“the” | “And nothing but”) x P(“truth” | “And nothing but the”)
Step 2: Approximate with trigrams
P(“And nothing but the truth” )
≈ P(“And”) P(“nothing” | “And”) x P(“but” | “And nothing”)
x P(“the” | “nothing but”) x P(“truth” | “but the”)
example
How do we find probabilities?
Get real text, and start counting!
P(“the | nothing but”)  C(“nothing but the”) / C(“nothing but”)
C ( s , John)
2
Training set:
P ( John| s ) 

C ( s )
3
“John read her book”
C ( John, read)
2
“I read a different book”
P ( read | John) 

C ( John)
2
“John read a book by Mulan”
C ( read, a )
2

C ( read)
3
C ( a, book)
1
P (book | a ) 

C (a)
2
C (book,  / s )
2
P ( / s | book) 

C (book)
3
P ( a | read) 
example
These bigram probabilities help us estimate the probability for the
sentence as:
P(John read a book)
= P(John|<s>)P(read|John)P(book|a)P(</s>|book)
= 0.148
Then cross entropy: -1/4*2log(0.148) = 0.689
So perplexity = 20.689 = 1.61
Comparison: Wall street journal text (5000 words) has a bigram
perplexity of 128
gram example
To calculate this probability,
we need to compute both
the number of times "am" is
preceded by "I" and the
number of times "here" is
preceded by "I am."
All four sounds the same, right decision can
only be made by language model.
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Evaluation
• How can you tell a good language model from a bad one?
• Run a machine translation system, a speech recognizer (or
your application of choice), calculate word error rate
– Slow
– Specific to your system
Evaluation: Perplexity Intuition
• Ask a speech recognizer to recognize digits: “0, 1,
2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10
• Ask a speech recognizer to recognize names at
Microsoft – hard – 30,000 – perplexity 30,000
• Ask a speech recognizer to recognize “Operator”
(1 in 4), “Technical support” (1 in 4), “sales” (1 in
4), 30,000 names (1 in 120,000) each – perplexity
54
• Perplexity is weighted equivalent branching factor.
Evaluation: perplexity
• “A, B, C, D, E, F, G…Z”: perplexity is 26
• “Alpha, bravo, charlie, delta…yankee, zulu”: perplexity is 26
• Perplexity measures language model difficulty, not acoustic
difficulty
• High perplexity means that the number of words branching
from a previous word is larger on average.
• Low perplexity does not guarantee good performance.
• For example B,C,D,E,G,P,T has 7 but does not take into
account acoustic confusability
Perplexity: Math
• Perplexity is geometric average inverse probability
• Imagine “Operator” (1 in 4),
“Technical support” (1 in 4),
“sales” (1 in 4), 30,000 names (1 in 120,000)
• Model thinks all probabilities are equal (1/30,003)
• Average inverse probability is 30,003
 n

Perplexity   Pwi | w1:i 1 
 i 1


1
n
Perplexity: Math
• Imagine “Operator” (1 in 4), “Technical
support” (1 in 4), “sales” (1 in 4), 30,000
names (1 in 120,000)
• Correct model gives these probabilities
• ¾ of time assigns probability ¼, ¼ of time
assigns probability 1/120,000
• Perplexity is 54 (compare to 30,003 for simple
model)
• Remarkable fact: the true model for data has
the lowest possible perplexity
Perplexity:Is lower better?
• Remarkable fact: the true model for data has the
lowest possible perplexity
• Lower the perplexity, the closer we are to true
model.
• Typically, perplexity correlates well with speech
recognition word error rate
– Correlates better when both models are trained on
same data
– Doesn’t correlate well when training data changes
Perplexity: The Shannon Game
• Ask people to guess the next letter, given context.
Compute perplexity.
Char n-gram Low Char Upper char Low word Upper word
1
9.1
16.3
191,237 4,702,511
5
3.2
6.5
653
29,532
10
2.0
4.3
45
2,998
15
2.3
4.3
97
2,998
100
1.5
2.5
10
142
– (when we get to entropy, the “100” column corresponds
to the “1 bit per character” estimate)
Homework
• Write a program to estimate the Entropy of written text
(Shannon, 1950)
• Input: a text document (you pick it, the larger the better)
• Write a program that predicts the next letter given the past
letters on a different text (make it interactive?)
– Hint: use character ngrams
• Check it does a perfect job on the training text
• Due 5/21
Evaluation: entropy
• Entropy =
log2 perplexity
 n

log2   Pwi | w1:i 1 
 i 1


1
n
 Should be called “cross-entropy of model on
test data.”
 Remarkable fact: entropy is average number
of bits per word required to encode test data
using this probability model, and an optimal
coder. Called bits.
perplexity
Encode text W using –2logP(W) bits.
Then the cross-entropy H(W) is:
1
H (W )   log 2 P(W )
N
Where N is the length of the text.
The perplexity is then defined as:
PP(W )  2
H (W )
Word Rank vs. Probability
Word Rank
1
10
100
1000
1.00000000
Word Probability
0.10000000
0.01000000
0.00100000
0.00010000
0.00001000
0.00000100
Hmm… There appears to be a relationship between word rank and
word probability. Plotting both on log scales, as above, reveals a linear,
or straight line, relationship. How strong is this relation? (next slide)
10000
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Smoothing: None
C ( xyz)
C ( xyz)
p( z | xy) 

C ( xy)  C ( xyw)
w
• Called Maximum Likelihood estimate.
• Lowest perplexity trigram on training data.
• Terrible on test data: If no occurrences of C(xyz),
probability is 0.
Smoothing: Add One
• What is P(sing|nuts)? Zero? Leads to infinite perplexity!
C ( xyz)  1
• Add one smoothing:
p( z | xy) 
C ( xy)  V
Works very badly. DO NOT DO THIS
• Add delta smoothing:
C ( xyz)  
p( z | xy) 
C ( xy)  V
Still very bad. DO NOT DO THIS
Smoothing: Simple Interpolation
C ( xyz)
C ( yz)
C ( z)
p( z | xy)  

 (1     )
C ( xy)
C ( y)
C ()
• Trigram is very context specific, very noisy
• Unigram is context-independent, smooth
• Interpolate Trigram, Bigram, Unigram for best
combination
• Find 0<<1 by optimizing on “held-out” data
• Almost good enough
Smoothing: Simple Interpolation
• Split data into training, “heldout”, test
• Try lots of different values for  on heldout data,
pick best
• Test on test data
• Sometimes, can use tricks like “EM” (estimation
maximization) to find values
• I prefer to use a generalized search algorithm,
“Powell search” – see Numerical Recipes in C
Smoothing: Simple Interpolation
• How much data for training, heldout, test?
• Some people say things like “1/3, 1/3, 1/3” or “80%, 10%,
10%” They are WRONG
• Heldout should have (at least) 100-1000 words per
parameter.
• Answer: enough test data to be statistically significant. (1000s
of words perhaps)
Smoothing: Simple Interpolation
• Be careful: WSJ data divided into stories. Some
are easy, with lots of numbers, financial, others
much harder. Use enough to cover many stories.
• Be careful: Some stories repeated in data sets.
• Can take data from end – better – or randomly
from within training. Temporal effects like “Swine
flu”
Smoothing: Jelinek-Mercer
• Simple interpolation:
C ( xyz)
Psmooth ( z | xy)  
 (1   ) Psmooth ( z | y )
C ( xy)
• Better: smooth a little after “The Dow”, lots after “Adobe
acquired”
C ( xyz)
Psmooth ( z | xy)   C ( xy) 
 (1   C ( xy) ) Psmooth ( z | y )
C ( xy)
Smoothing: Jelinek-Mercer
C ( xyz)
Psmooth ( z | xy)   C ( xy) 
 (1   C ( xy) ) Psmooth ( z | y )
C ( xy)
• Put s into buckets by count
• Find s by cross-validation on held-out data
• Also called “deleted-interpolation”
Smoothing: Katz
• Compute discount using “Good-Turing” estimate
• Only use bigram if trigram is missing
 C ( xyz)  DC( xyz)
if C ( xyz)  0

PKatz ( z | xy)  
C ( xy)

otherwise
  ( xy) PKatz ( z | y )
• Works pretty well, except not good for 1 counts
•  is calculated so probabilities sum to 1
Smoothing: Interpolated Absolute Discount
• JM, Simple Interpolation overdiscount large
counts, underdiscount small counts
C ( xyz)
 C ( xy) 
 (1   C ( xy) ) Psmooth ( z | y )
C ( xy)
• “San Francisco” 100 times, “San Alex” once,
should we use a big discount or a small one?
– Absolute discounting takes the same from everyone
C ( xyz)  D
Pabs int erp ( z | xy) 
  ( xy) Pabs int erp ( z | y)
C ( xy)
Smoothing: Interpolated Multiple
Absolute Discounts
• One discount is good
C ( xyz)  D
  ( xy) Pabs int erp ( z | y )
C ( xy)
• Different discounts for different counts
C ( xyz)  DC ( xyz)
C ( xy)
  ( xy) Pabs int erp ( z | y)
• Multiple discounts: for 1 count, 2 counts, >2
Smoothing: Kneser-Ney
P(Francisco | eggplant) vs P(stew | eggplant)
• “Francisco” is common, so backoff, interpolated methods say
it is likely
• But it only occurs in context of “San”
• “Stew” is common, and in many contexts
• Weight backoff by number of contexts word occurs in
Smoothing: Kneser-Ney
• Interpolated Absolute-discount
• Modified backoff distribution
• Consistently best technique
C ( xyz)  DC ( xyz)
C ( xy)
w | C (wyz)  0
  ( xy)
 w | C (wyv)  0
v
Smoothing: Chart
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Caching
• If you say something, you are likely to say it again later
• Interpolate trigram with cache
P( z | history)  Psmooth ( z | xy)  (1   ) Pcache ( z | history)
C ( z  history)
Pcache ( z | history) 
length(history)
Caching: Real Life
• Someone says “I swear to tell the truth”
• System hears “I swerve to smell the soup”
• Cache remembers!
• Person says “The whole truth”, and, with cache,
system hears “The whole soup.” – errors are
locked in.
• Caching works well when users correct as they go,
poorly or even hurts without correction.
Cache Results
40
25
unigram + cond
bigram + cond
trigram
unigram + cond
trigram
20
trigram
Perplexity Reduction
35
30
15
bigram
10
5
unigram
0
100,000
1,000,000
10,000,000
Training Data Size
all
5-grams
• Why stop at 3-grams?
• If P(z|…rstuvwxy) P(z|xy) is good, then
P(z|…rstuvwxy)  P(z|vwxy) is better!
• Very important to smooth well
• Interpolated Kneser-Ney works much better than Katz on 5gram, more than on 3-gram
N-gram versus smoothing algorithm
10
9.5
Entropy
9
100,000 Katz
8.5
100,000 KN
8
1,000,000 Katz
7.5
1,000,000 KN
7
10,000,000 Katz
10,000,000 KN
6.5
all Katz
6
all KN
5.5
1 2 3 4 5 6 7 8 9 10 20
n-gram order
Speech recognizer mechanics
• Keep many hypotheses alive
“…tell the” (.01)
“…smell the” (.01)
• Find acoustic, language model scores
– P(acoustics | truth = .3), P(truth | tell the) = .1
– P(acoustics | soup = .2), P(soup | smell the) = .01
“…tell the truth” (.01  .3 .1)
“…smell the soup” (.01  .2 .01)
Speech recognizer slowdowns
• Speech recognizer uses tricks (dynamic programming) to
merge hypotheses
Trigram:
Fivegram:
“…tell the”
“…smell the”
“…swear to tell the”
“…swerve to smell the”
“…swear too tell the”
“…swerve too smell the”
“…swerve to tell the”
“…swerve too tell the”
…
Speech recognizer vs. n-gram
• Recognizer can threshold out bad hypotheses
• Trigram works so much better than bigram, better
thresholding, no slow-down
• 4-gram, 5-gram start to become expensive
Speech recognizer with language model
• In theory,
arg max P(acoustics | wordsequence)  P( wordsequence)
wordsequence
• In practice, language model is a better predictor -- acoustic
probabilities aren’t “real” probabilities
• In practice, penalize insertions
P(acoustics | wordsequence) 
arg max
8
length( wordsequence )
wordsequence P ( wordsequen ce )  .1
Skipping
• P(z|…rstuvwxy)  P(z|vwxy)
• Why not P(z|v_xy) – “skipping” n-gram – skips value of 3-back
word.
• Example: “P(time | show John a good)” ->
P(time | show ____ a good)
• P(…rstuvwxy)  P(z|vwxy) + P(z|vw_y) + (1--)P(z|v_xy)
5-gram Skipping Results
vwyx, vxyw, wxyv,
vywx, yvwx, xvwy,
wvxy
vw_y, v_xy, vwx_ -skipping
7
Perplexity Reduction
6
5
xvwy, wvxy, yvwx, -rearranging
4
3
vwyx, vxyw, wxyv -rearranging
2
1
vwyx, vywx, yvwx -rearranging
0
10000
vw_y, v_xy -- skipping
100000
1000000 10000000
1E+08
1E+09
Training Size
(Best trigram skipping result: 11% reduction)
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Clustering
•
•
•
•
•
CLUSTERING = CLASSES (same thing)
What is P(“Tuesday | party on”)
Similar to P(“Monday | party on”)
Similar to P(“Tuesday | celebration on”)
Put words in clusters:
– WEEKDAY = Sunday, Monday, Tuesday, …
– EVENT=party, celebration, birthday, …
Clustering overview
• Major topic, useful in many fields
• Kinds of clustering
– Predictive clustering
– Conditional clustering
– IBM-style clustering
• How to get clusters
– Be clever or it takes forever!
Predictive clustering
• Let “z” be a word, “Z” be its cluster
• One cluster per word: hard clustering
– WEEKDAY = Sunday, Monday, Tuesday, …
– MONTH = January, February, April, May, June, …
• P(z|xy) = P(Z|xy)  P(z|xyZ)
• P(Tuesday | party on) = P(WEEKDAY | party on) 
P(Tuesday | party on WEEKDAY)
• Psmooth(z|xy)  Psmooth (Z|xy)  Psmooth (z|xyZ)
Predictive clustering example
Find P(Tuesday | party on)
Psmooth (WEEKDAY | party on) 
Psmooth (Tuesday | party on WEEKDAY)
C( party on Tuesday) = 0
C(party on Wednesday) = 10
C(arriving on Tuesday) = 10
C(on Tuesday) = 100
Psmooth (WEEKDAY | party on) is high
Psmooth (Tuesday | party on WEEKDAY) backs off to
Psmooth (Tuesday | on WEEKDAY)
Cluster Results
Kneser-Ney
trigram
Predict
20
Perplexity Reduction
15
10
5
IBM
0
Full IBM
Predict
All Combine
-5
-10
-15
-20
100,000
1,000,000
Training Size
10,000,000
Clustering: how to get them
• Build them by hand
– Works ok when almost no data
• Part of Speech (POS) tags
– Tends not to work as well as automatic
• Automatic Clustering
– Swap words between clusters to minimize perplexity
Clustering: automatic
Minimize perplexity of P(z|Y)
Mathematical tricks speed it up
Use top-down splitting,
not bottom up merging!
Two actual WSJ classes
• MONDAYS
• FRIDAYS
• THURSDAY
• MONDAY
• EURODOLLARS
• SATURDAY
• WEDNESDAY
• FRIDAY
• TENTERHOOKS
• TUESDAY
• SUNDAY
• CONDITION
• PARTY
• FESCO
• CULT
• NILSON
• PETA
• CAMPAIGN
• WESTPAC
• FORCE
• CONRAN
• DEPARTMENT
• PENH
• GUILD
Sentence Mixture Models
• Lots of different sentence types:
– Numbers (The Dow rose one hundred seventy three
points)
– Quotations (Officials said “quote we deny all wrong doing
”quote)
– Mergers (AOL and Time Warner, in an attempt to control
the media and the internet, will merge)
• Model each sentence type separately
Sentence Mixture Models
• Roll a die to pick sentence type, sk with probability k
• Probability of sentence, given sk
• Probability of sentence across types:
m
n
   P( w | w
k 1
k
i 1
i
i 2
wi 1sk )
Sentence Model Smoothing
• Each topic model is smoothed with overall model
• Sentence mixture model is smoothed with overall model
(sentence type 0)
 k P( wi | wi 2 wi 1sk )

 k 
  (1   ) P( w | w w )
i 1 
k 0
k
i
i  2 i 1 
m
n
Sentence Mixture Results
20
18
Perplexity Reduction
16
all 3gram
14
all 5gram
12
10
10,000,000 3gram
8
10,000,000 5gram
6
1,000,000 3gram
4
1,000,000 5gram
2
100,000 5gram
0
100,000 3gram
1
2
4
8
16
32
Number of Sentence Types
64
128
Sentence Clustering
• Same algorithm as word clustering
• Assign each sentence to a type, sk
• Minimize perplexity of P(z|sk ) instead of P(z|Y)
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Structured Language Model
“The contract ended with a loss of 7 cents after”
Thanks to Ciprian Chelba for this figure
How to get structure data?
• Use a Treebank (a collection of sentences with
structure hand annotated) like Wall Street Journal,
Penn Tree Bank.
• Problem: need a treebank.
• Or – use a treebank (WSJ) to train a parser; then
parse new training data (e.g. Broadcast News)
• Re-estimate parameters to get lower perplexity
models.
Parsing vs. Trigram
Eugene Charniaks’s Experiments
Model
Perplexity
Trigram poor smoothing
167
Trigram deleted-interpolation
155
Trigram Kneser-Ney
145
Parsing
119
18%
All experiments are trained on one million words
of Penn tree-bank data, and tested on 80,000
words.
Thanks to Eugene Charniak for this slide
Structured Language Models
• Promising results
• But: time consuming; language is right branching; 5-grams,
skipping, capture similar information.
• Interesting applications to parsing
– Combines nicely with parsing MT systems
N-best lists
• Make list of 100 best translation hypotheses using simple
bigram or trigram
• Rescore using any model you want
– Cheaply apply complex models
– Perform Source research separately from Channel
• For long, complex sentences, need exponentially many more
hypotheses
Lattices for MT
Compact version of n-best list
From Ueffing, Och and Ney,
EMNLP ‘02
Tools: CMU Language Modeling Toolkit
• Can handle bigram, trigrams, more
• Can handle different smoothing schemes
• Many separate tools – output of one tool is input to
next: easy to use
• Free for research purposes
• http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
Tools: SRI Language Modeling Toolkit
• More powerful than CMU toolkit
• Can handles clusters, lattices, n-best lists, hidden
tags
• Free for research use
• http://www.speech.sri.com/projects/srilm
Small enough
• Real language models are often huge
• 5-gram models typically larger than the training
data
• Use count-cutoffs (eliminate parameters with
fewer counts) or, better
• Use Stolcke pruning – finds counts that contribute
least to perplexity reduction,
– P(City | New York)  P(City | York)
– P(Friday | God it’s)  P(Friday | it’s)
• Remember, Kneser-Ney helped most when lots of
1 counts
Combining Data
• Often, you have some “in domain” data and some
“out of domain data”
• Example: Microsoft is working on translating
computer manuals
– Only about 3 million words of Brazilian computer
manuals
• Can combine computer manual data with
hundreds of millions of words of other data
– Newspapers, web, encylcopedias,usenet…
How to combine
• Just concatenate – add them all together
– Bad idea – need to weight the “in domain” data more heavily
• Take out of domain data and multiple copies of in domain
data (weight the counts)
– Bad idea – doesn’t work well, and messes up most smoothing
techniques
How to combine
• A good way: take weighted average, e.g.
Pmanuals (z|xy) +  Pweb(z|xy) + (1- - )
Pnewspaper(z|xy)
• Can apply to channel models too (e.g. combine
Hansard with computer manuals for French
translation)
• Lots of research in other techniques
– Maxent-inspired models, non-linear interpolation (log
domain), cluster models, etc. Minimal improvement (but
see work by Rukmini Iyer)
Other Language Model Uses
• Handwriting Recognition
– P(observed ink|words) 
P(words)
• Telephone Keypad input
– P(numbers|words) 
P(words)
• Spelling Correction
– P(observed keys|words) 
P(words)
• Chinese/Japanese text entry
– P(phonetic representation|characters) 
P(characters)
Language
Model
Some Experiments
• Joshua Goodman re-implemented almost all
techniques
• Trained on 260,000,000 words of WSJ
• Optimize parameters on heldout
• Test on separate test section
• Some combinations extremely time-consuming
(days of CPU time)
– Don’t try this at home, or in anything you want to ship
• Rescored N-best lists to get results
– Maximum possible improvement from 10% word error
rate absolute to 5%
Overall Results: Perplexity
115
Katz
110
KN
Katz 5-gram
Katz Sentence
Katz
skip
KN Cluster
KN Sentence
KN Skip
Katz Cluster
Perplexity
105
100
95
KN 5gram
90
all-cache-5gram
85
all-cache-KN
all-cache-sentence
80
75
all-cache
all-cache-cluster
all-cache-skip
70
8.8
9
9.2
9.4
Word Error Rate
9.6
9.8
10
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
LM types
Language models used in speech recognition can be classified into
the following categories:
• Uniform models: the chance a word occurs is 1 / V. V is the size of
the vocabulary
• Finite state machines
• Grammar models: they use context free grammars
• Stochastic models: they determine the chance of a word on it’s
preceding words (eg n-grams)
CFG
A grammar is defined by:
G = (V, T, P, S) where:
V contains the set of all non-terminal symbols.
T contains the set of all terminal symbols.
P is a set of production or production rules.
S is a special symbol called the start symbol.
Example of rules:
S -> NP VP
VP -> V NP
NP -> NOUN
NP -> NAME
NOUN -> speech
NAME -> Julie Ethan
VERB -> loves chases
CFG
Parsing
• bottom up where you start with the input
sentence and try to reach the start symbol
• Top down, you start with the starting
symbol and try to reach the input
sentence by applying the appropriate
rules. Left recursion is a problem. (A ->
Aa)
Advantage bottom up:
“What is the weather forecast for this
afternoon?”
A lot of parsing algorithms available from
computer science
Problem: people don’t follow the rules of grammar strictly,
especially in spoken language. Creating a grammar that
covers all this constructions is unfeasible.
probabilistic CFG
A mixture between formal language and probabilistic models is the
PCFG
If there are m rules for left-hand side non terminal node
A : A  1 , A   2 ..., A   m
Then probability of these rules is
m
P( A   j | G )  C ( A   j ) /  C ( A   i )
i 1
Where C denotes the number of times each rule is used.
Outline
• Prob theory intro
• Text prediction
• Intro to LM
• Perplexity
• Smoothing
• Caching
• Clustering
• Parsing
• CFG
• Homework
Homework
• Write a program to estimate the Entropy of written text
(Shannon, 1950)
• Input: a text document (you pick it, the larger the better)
• Write a program that predicts the next letter given the past
letters on a different text (make it interactive?)
– Hint: use character ngrams
• Check it does a perfect job on the training text
• Due 5/21