Introduction to Natural Language Processing (600.465)

Download Report

Transcript Introduction to Natural Language Processing (600.465)

*Introduction to
Natural Language Processing (600.465)
Language Modeling
(and the Noisy Channel)
Dr. Jan Hajič
CS Dept., Johns Hopkins Univ.
[email protected]
www.cs.jhu.edu/~hajic
1
The Noisy Channel
• Prototypical case:
Input
0,1,1,1,0,1,0,1,...
Output (noisy)
The channel
(adds noise)
0,1,1,0,0,1,1,0,...
• Model: probability of error (noise):
• Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6
• The Task:
known: the noisy output; want to know: the input (decoding)
2
Noisy Channel Applications
• OCR
– straightforward: text → print (adds noise), scan → image
• Handwriting recognition
– text → neurons, muscles (“noise”), scan/digitize → image
• Speech recognition (dictation, commands, etc.)
– text → conversion to acoustic signal (“noise”) → acoustic waves
• Machine Translation
– text in target language → translation (“noise”) → source language
• Also: Part of Speech Tagging
– sequence of tags → selection of word forms → text
3
Noisy Channel: The Golden Rule of ...
OCR, ASR, HR, MT, ...
• Recall:
p(A|B) = p(B|A) p(A) / p(B) (Bayes formula)
Abest = argmaxA p(B|A) p(A) (The Golden Rule)
• p(B|A): the acoustic/image/translation/lexical model
– application-specific name
– will explore later
• p(A): the language model
4
Dan Jurafsky
Probabilistic Language Models
• Today’s goal: assign a probability to a sentence
• Machine Translation:
• P(high winds tonite) > P(large winds tonite)
• Spell Correction
• The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen minuets from)
Why?
• Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
• + Summarization, question-answering, etc., etc.!!
Dan Jurafsky
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or sequence of
words:
•
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
•
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
•
P(W)
or
P(wn|w1,w2…wn-1)
• Better: the grammar
is called a language model.
But language model or LM is standard
The Perfect Language Model
• Sequence of word forms [forget about tagging for the moment]
• Notation: A ~ W = (w1,w2,w3,...,wd)
• The big (modeling) question:
p(W) = ?
• Well, we know (Bayes/chain rule →):
p(W) = p(w1,w2,w3,...,wd) =
= p(w1)ⅹ p(w2|w1)ⅹp(w3|w1,w2) ⅹ...ⅹ p(wd|w1,w2,...,wd-1)
• Not practical (even short W → too many parameters)
7
Markov Chain
• Unlimited memory (cf. previous foil):
– for wi, we know all its predecessors w1,w2,w3,...,wi-1
• Limited memory:
– we disregard “too old” predecessors
– remember only k previous words: wi-k,wi-k+1,...,wi-1
– called “kth order Markov approximation”
• + stationary character (no change over time):
p(W) @ Pi=1..dp(wi|wi-k,wi-k+1,...,wi-1), d = |W|
8
n-gram Language Models
• (n-1)th order Markov approximation → n-gram LM:
prediction
history
p(W) =df Pi=1..dp(wi|wi-n+1,wi-n+2,...,wi-1)
!
• In particular (assume vocabulary |V| = 60k):
•
•
•
•
0-gram LM: uniform model,
1-gram LM: unigram model,
2-gram LM: bigram model,
3-gram LM: trigram model,
p(w) = 1/|V|,
p(w),
p(wi|wi-1)
p(wi|wi-2,wi-1)
1 parameter
6ⅹ104 parameters
3.6ⅹ109 parameters
2.16ⅹ1014 parameters
9
LM: Observations
• How large n?
– nothing is enough (theoretically)
– but anyway: as much as possible (→ close to “perfect” model)
– empirically: 3
• parameter estimation? (reliability, data availability, storage space, ...)
• 4 is too much: |V|=60k → 1.296ⅹ1019 parameters
• but: 6-7 would be (almost) ideal (having enough data): in fact, one can
recover original from 7-grams!
• Reliability ~ (1 / Detail) (→ need compromise)
(detail=many gram)
• For now, keep word forms (no “linguistic” processing)
10
Parameter Estimation
• Parameter: numerical value needed to compute p(w|h)
• From data (how else?)
• Data preparation:
•
•
•
•
get rid of formatting etc. (“text cleaning”)
define words (separate but include punctuation, call it “word”)
define sentence boundaries (insert “words” <s> and </s>)
letter case: keep, discard, or be smart:
– name recognition
– number type identification
[these are huge problems per se!]
• numbers: keep, replace by <num>, or be smart (form ~ pronunciation)
11
Maximum Likelihood Estimate
• MLE: Relative Frequency...
– ...best predicts the data at hand (the “training data”)
• Trigrams from Training Data T:
– count sequences of three words in T: c3(wi-2,wi-1,wi)
– [NB: notation: just saying that the three words follow each other]
– count sequences of two words in T: c2(wi-1,wi):
• either use c2(y,z) = Sw c3(y,z,w)
• or count differently at the beginning (& end) of data!
p(wi|wi-2,wi-1) =est. c3(wi-2,wi-1,wi) / c2(wi-2,wi-1)
!
12
Character Language Model
• Use individual characters instead of words:
p(W) =df Pi=1..dp(ci|ci-n+1,ci-n+2,...,ci-1)
•
•
•
•
Same formulas etc.
Might consider 4-grams, 5-grams or even more
Good only for language comparison
Transform cross-entropy between letter- and
word-based models:
HS(pc) = HS(pw) / avg. # of characters/word in S
13
LM: an Example
• Training data:
–
–
–
–
<s> <s> He can buy the can of soda.
Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125
p1(can) = .25
Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,
p2(of|can) = .5, p2(the|buy) = 1,...
Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,
p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.
(normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25,
H(p3) = 0 ← Great?!
14
Dan Jurafsky
Language Modeling Toolkits
• SRILM
• http://www.speech.sri.com/projects/srilm/
Dan Jurafsky
Google N-Gram Release, August 2006
…
Dan Jurafsky
Google N-Gram Release
•
•
•
•
•
•
•
•
•
•
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan Jurafsky
Google Book N-grams
• http://ngrams.googlelabs.com/
Dan Jurafsky
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?
• Assign higher probability to “real” or “frequently observed” sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t seen.
• A test set is an unseen dataset that is different from our training set,
totally unused.
• An evaluation metric tells us how well our model does on the test set.
Dan Jurafsky
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B
• Put each model in a task
• spelling corrector, speech recognizer, MT system
• Run the task, get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
• Compare accuracy for A and B
Dan Jurafsky
Difficulty of extrinsic (in-vivo) evaluation
of N-gram models
• Extrinsic evaluation
• Time-consuming; can take days or weeks
• So
• Sometimes use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• So generally only useful in pilot experiments
• But is helpful to think about.
Dan Jurafsky
Intuition of Perplexity
• The Shannon Game:
• How well can we predict the next word?
I always order pizza with cheese and ____
The 33rd President of the US was ____
I saw a ____
• Unigrams are terrible at this game. (Why?)
mushrooms 0.1
pepperoni 0.1
Claude Shannon
anchovies 0.01
….
fried rice 0.0001
….
and 1e-100
• A better model of a text
• is one which assigns a higher probability to the word that actually occurs
Dan Jurafsky
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
• Perplexity is the probability of the test set, normalized by the
number of words:
•
Chain rule:
•
For bigrams:
Minimizing perplexity is the same as maximizing probability
Dan Jurafsky
The Shannon Game intuition for perplexity
• From Josh Goodman
• How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’
• Perplexity 10
• How hard is recognizing (30,000) names at Microsoft.
• Perplexity = 30,000
• If a system has to recognize
•
•
•
•
•
Operator (1 in 4)
Sales (1 in 4)
Technical Support (1 in 4)
30,000 names (1 in 120,000 each)
Perplexity is 53
• Perplexity is weighted equivalent branching factor
Dan Jurafsky
Perplexity as branching factor
• Let’s suppose a sentence consisting of random digits
• What is the perplexity of this sentence according to a model
that assign P=1/10 to each digit?
Dan Jurafsky
Lower perplexity = better model
• Training 38 million words, test 1.5 million words, WSJ
N-gram Order
Unigram
Bigram
Trigram
Perplexity
962
170
109
Dan Jurafsky
The wall street journal
LM: an Example
• Training data:
–
–
–
–
<s> <s> He can buy the can of soda.
Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125
p1(can) = .25
Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,
p2(of|can) = .5, p2(the|buy) = 1,...
Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,
p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.
(normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25,
H(p3) = 0 ← Great?!
28
LM: an Example (The Problem)
• Cross-entropy:
• S = <s> <s> It was the greatest buy of all. (test data)
• Even HS(p1) fails (= HS(p2) = HS(p3) = ∞), because:
– all unigrams but p1(the), p1(buy), p1(of) and p1(.) are 0.
– all bigram probabilities are 0.
– all trigram probabilities are 0.
• We want: to make all probabilities non-zero. 
data sparseness handling
29
The Zero Problem
• “Raw” n-gram language model estimate:
– necessarily, some zeros
• !many: trigram model → 2.16ⅹ1014 parameters, data ~ 109 words
– which are true 0?
• optimal situation: even the least frequent trigram would be seen several
times, in order to distinguish it’s probability vs. other trigrams
• optimal situation cannot happen, unfortunately (open question: how
many data would we need?)
– → we don’t know
– we must eliminate the zeros
• Two kinds of zeros: p(w|h) = 0, or even p(h) = 0!
30
Why do we need Nonzero Probs?
• To avoid infinite Cross Entropy:
– happens when an event is found in test data which has
not been seen in training data
H(p) = ∞: prevents comparing data with ≥ 0 “errors”
• To make the system more robust
– low count estimates:
• they typically happen for “detailed” but relatively rare
appearances
– high count estimates: reliable but less “detailed”
31
Eliminating the Zero Probabilities:
Smoothing
• Get new p’(w) (same W): almost p(w) but no zeros
• Discount w for (some) p(w) > 0: new p’(w) < p(w)
Sw∈discounted (p(w) - p’(w)) = D
• Distribute D to all w; p(w) = 0: new p’(w) > p(w)
– possibly also to other w with low p(w)
• For some w (possibly): p’(w) = p(w)
• Make sure Sw∈W p’(w) = 1
• There are many ways of smoothing
32
Smoothing by Adding 1(Laplace)
• Simplest but not really usable:
– Predicting words w from a vocabulary V, training data T:
p’(w|h) = (c(h,w) + 1) / (c(h) + |V|)
• for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|)
– Problem if |V| > c(h) (as is often the case; even >> c(h)!)
• Example:
Training data:
<s> what is it what is small ?
|T| = 8
• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12
• p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 @ .001
p(it is flying.) = .125ⅹ.25ⅹ02 = 0
• p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .152ⅹ.12 @
.0002
p’(it is flying.) = .1ⅹ.15ⅹ.052 @ .00004
(assume word independence!)
33
Adding less than 1
• Equally simple:
– Predicting words w from a vocabulary V, training data T:
p’(w|h) = (c(h,w) + l) / (c(h) + l|V|), l < 1
• for non-conditional distributions: p’(w) = (c(w) + l) / (|T| + l|V|)
• Example:
Training data:
<s> what is it what is small ?
|T| = 8
• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12
• p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 @ .001
p(it is flying.) = .125ⅹ.25´02 = 0
• Use l = .1:
• p’(it)@ .12, p’(what)@ .23, p’(.)@ .01 p’(what is it?) = .232ⅹ.122 @ .0007
p’(it is flying.) = .12ⅹ.23ⅹ.012 @ .000003
34
Language
Modeling
Advanced: Good Turing
Smoothing
Reminder: Add-1 (Laplace)
Smoothing
c(wi-1,wi )+1
PAdd-1 (wi | wi-1 ) =
c(wi-1 )+V
More general formulations: Add-k
c(wi-1,wi )+ k
PAdd-k (wi | wi-1 ) =
c(wi-1 )+ kV
1
c(wi-1,wi )+ m( )
V
PAdd-k (wi | wi-1 ) =
c(wi-1 )+ m
Unigram prior smoothing
1
c(wi-1,wi )+ m( )
V
PAdd-k (wi | wi-1 ) =
c(wi-1 )+ m
c(wi-1, wi )+ mP(wi )
PUnigramPrior (wi | wi-1 ) =
c(wi-1 )+ m
Advanced smoothing algorithms
• Intuition used by many smoothing algorithms
– Good-Turing
– Kneser-Ney
– Witten-Bell
• Use the count of things we’ve seen once
– to help estimate the count of things we’ve never seen
Notation: Nc = Frequency of
frequency c
• Nc = the count of things we’ve seen c times
• Sam I am I am Sam I do not eat
I
3
sam 2
am 2
N1 = 3
do 1
N2 = 2
not 1
eat 1
N3 = 1
40
Good-Turing smoothing intuition
• You are fishing (a scenario from Josh Goodman), and caught:
– 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
• How likely is it that next species is trout?
– 1/18
• How likely is it that next species is new (i.e. catfish or bass)
– Let’s use our estimate of things-we-saw-once to estimate the new things.
– 3/18 (because N1=3)
• Assuming so, how likely is it that next species is trout?
– Must be less than 1/18 – discounted by 3/18!!
– How to estimate?
Good Turing calculations
N1
P (things with zero frequency) =
N
*
GT
• Unseen (bass or
catfish)
– c = 0:
– MLE p = 0/18 = 0
– P*GT (unseen) = N1/N =
3/18
(c +1)Nc+1
c* =
Nc
• Seen once (trout)
• c=1
• MLE p = 1/18
• C*(trout) = 2 * N2/N1
= 2 * 1/3
= 2/3
• P*GT(trout) = 2/3 / 18 = 1/27
Ney et al.’s Good Turing Intuition
H. Ney, U. Essen, and R. Kneser, 1995. On the estimation of 'small' probabilities by leaving-one-out.
IEEE Trans. PAMI. 17:12,1202-1212
Held-out words:
43
Training
Held out
Ney et al. Good Turing Intuition
(slide from Dan Klein)
N0
N2
N1
N3
N2
– Take each of the c training words out in turn
– c training sets of size c–1, held-out of size 1
– What fraction of held-out words are unseen in training?
• N1/c
– What fraction of held-out words are seen k times in
training?
• (k+1)Nk+1/c
– So in the future we expect (k+1)Nk+1/c of the words to be
those with training count k
– There are Nk words with training count k
– Each should occur with probability:
• (k+1)Nk+1/c/Nk
– …or expected count:
(k +1)N k+1
k* =
Nk
....
Intuition from leave-one-out validation
....
•
N1
N3511
N3510
N4417
N4416
Good-Turing complications
(slide from Dan Klein)
• Problem: what about “the”? (say c=4417)
– For small k, Nk > Nk+1
– For large k, too jumpy, zeros
wreck estimates
– Simple Good-Turing [Gale and
Sampson]: replace empirical Nk
with a best-fit power law once
counts get unreliable
N1
N2
N1
N2
N3
Resulting Good-Turing numbers
• Numbers from Church and Gale (1991)
• 22 million words of AP Newswire
(c +1)Nc+1
c* =
Nc
Coun
tc
0
1
2
3
4
5
6
7
8
9
Good Turing
c*
.0000270
0.446
1.26
2.24
3.24
4.22
5.19
6.21
7.24
8.25
Language Modeling
Advanced:
Kneser-Ney Smoothing
Resulting Good-Turing numbers
• Numbers from Church and Gale (1991)
• 22 million words of AP Newswire
(c +1)Nc+1
c* =
Nc
• It sure looks like c* = (c - .75)
Count
c
Good Turing c*
0
.0000270
1
0.446
2
1.26
3
2.24
4
3.24
5
4.22
6
5.19
7
6.21
8
7.24
9
8.25
Absolute Discounting Interpolation
• Save ourselves some time and just subtract 0.75
Interpolation weight
discounted bigram
(or some d)!
c(wi-1,wi )- d
PAbsoluteDiscounting (wi | wi-1 ) =
+ l (wi-1 )P(w)
c(wi-1 )
unigram
– (Maybe keeping a couple extra values of d for counts 1
and 2)
• But should we really just use the regular unigram
P(w)?
49
Kneser-Ney Smoothing I
• Better estimate for probabilities of lower-order unigrams!
glasses
– Shannon game: I can’t see without my reading___________?
Francisco
– “Francisco” is more common than “glasses”
– … but “Francisco” always follows “San”
• The unigram is useful exactly when we haven’t seen this bigram!
• Instead of P(w): “How likely is w”
• Pcontinuation(w): “How likely is w to appear as a novel continuation?
– For each word, count the number of bigram types it completes
– Every bigram type was a novel continuation the first time it was
seen
PCONTINUATION (w)µ {wi-1 :c(wi-1,w) > 0}
Kneser-Ney Smoothing II
• How many times does w appear as a novel continuation:
PCONTINUATION (w)µ {wi-1 :c(wi-1,w) > 0}
• Normalized by the total number of word bigram types
{(w j-1,w j ): c(w j-1,w j ) > 0}
PCONTINUATION (w) =
{wi-1 :c(wi-1,w) > 0}
{(w j-1,w j ) :c(w j-1,w j ) > 0}
Kneser-Ney Smoothing III
• Alternative metaphor: The number of # of word types seen to precede
w
|{wi-1 :c(wi-1,w) > 0}|
• normalized by the # of words preceding all words:
PCONTINUATION (w) =
{wi-1 : c(wi-1,w) > 0}
å {w'
i-1
•
: c(w'i-1,w') > 0}
w'
A frequent word (Francisco) occurring in only one context (San) will
have a low continuation probability
Kneser-Ney Smoothing IV
max(c(wi-1,wi )- d,0)
PKN (wi | wi-1 ) =
+ l (wi-1 )PCONTINUATION (wi )
c(wi-1 )
λ is a normalizing constant; the probability mass we’ve discounted
d
l (wi-1 ) =
{w :c(wi-1,w) > 0}
c(wi-1 )
the normalized discount
The number of word types that can follow wi-1
= # of word types we discounted
= # of times we applied normalized discount
53
Kneser-Ney Smoothing: Recursive
formulation
i
max(cKN (wi-n+1 )- d,0)
i-1
i-1
i-1
PKN (wi | wi-n+1 ) =
+
l
(w
)P
(w
|
w
)
i-n+1
KN
i
i-n+2
i-1
cKN (wi-n+1 )
ìï
count(·) for the highest order
cKN (·) = í
ïî continuationcount(·) for lower order
Continuation count = Number of unique single word contexts for 
54
Backoff and Interpolation
• Sometimes it helps to use less context
– Condition on less context for contexts you haven’t
learned much about
• Backoff:
– use trigram if you have good evidence,
– otherwise bigram, otherwise unigram
• Interpolation:
– mix unigram, bigram, trigram
• Interpolation works better
Smoothing by Combination:
Linear Interpolation
• Combine what?
• distributions of various level of detail vs. reliability
• n-gram models:
• use (n-1)gram, (n-2)gram, ..., uniform
reliability
detail
• Simplest possible combination:
– sum of probabilities, normalize:
• p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6:
• p’(0|0) = .6, p’(1|0) = .4, p’(0|1) = .7, p’(1|1) = .3
• (p’(0|0) = 0.5p(0|0) + 0.5p(0))
56
Typical n-gram LM Smoothing
• Weight in less detailed distributions using l=(l0,l1,l2,l3):
p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +
l2 p2(wi| wi-1) + l1 p1(wi) + l0 /|V|
• Normalize:
li > 0, Si=0..n li = 1 is sufficient (l0 = 1 - Si=1..n li) (n=3)
• Estimation using MLE:
– fix the p3, p2, p1 and |V| parameters as estimated from the
training data
– then find such {li}which minimizes the cross entropy
(maximizes probability of data): -(1/|D|)Si=1..|D|log2(p’l(wi|hi))
57
Held-out Data
• What data to use? (to estimate l)
– (bad) try the training data T: but we will always get l3 = 1
• why? (let piT be an i-gram distribution estimated using relative freq.
from T)
• minimizing HT(p’l) over a vector l, p’l = l3p3T+l2p2T+l1p1T+l0/|V|
–
–
–
–
–
remember: HT(p’l) = H(p3T) + D(p3T||p’l); (p3T fixed → H(p3T) fixed, best)
which p’l minimizes HT(p’l)? Obviously, a p’l for which D(p3T|| p’l)=0
...and that’s p3T (because D(p||p) = 0, as we know).
...and certainly p’l = p3T if l3 = 1 (maybe in some other cases, too).
(p’l = 1ⅹ p3T + 0 ⅹp2T + 0ⅹ p1T + 0/|V|)
– thus: do not use the training data for estimation of l!
• must hold out part of the training data (heldout data, H):
• ...call the remaining data the (true/raw) training data, T
• the test data S (e.g., for comparison purposes): still different data!
58
The Formulas (for H)
• Repeat: minimizing -(1/|H|)Si=1..|H|log2(p’l(wi|hi)) over l
p’l(wi| hi) = p’l(wi| wi-2 ,wi-1) = l3 p3(wi| wi-2 ,wi-1) +
l2 p2(wi| wi-1) + l1 p1(wi) + l0 /|V|
!
• “Expected Counts (of lambdas)”: j = 0..3 – next page
c(lj) = Si=1..|H| (ljpj(wi|hi) / p’l(wi|hi))
!
• “Next l”: j = 0..3
lj,next = c(lj) / Sk=0..3 (c(lk))
!
59
The (Smoothing) EM Algorithm
1. Start with some l, such that lj > 0 for all j ∈0..3.
2. Compute “Expected Counts” for each lj.
3. Compute new set of lj, using the “Next l” formula.
4. Start over at step 2, unless a termination condition is
met.
• Termination condition: convergence of l.
– Simply set an e, and finish if |lj - lj,next| < e for each j (step 3).
• Guaranteed to converge:
follows from Jensen’s inequality, plus a technical proof.
60
Simple Example
• Raw distribution (unigram only; smooth with uniform):
p(a) = .25, p(b) = .5, p(a) = 1/64 for a∈{c…r}, = 0 for the rest: s,t,u,v,w,x,y,z
• Heldout data: baby; use one set of l (l1: unigram, l0: uniform)
• Start with l1 = .5; p’l(b) = .5 x .5 + .5 / 26 = .27
p’l(a) = .5 x .25 + .5 / 26 = .14
p’l(y) = .5 x 0 + .5 / 26 = .02
c(l1) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.72
c(l0) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28
Normalize: l1,next = .68, l0,next = .32.
Repeat from step 2 (recompute p’l first for efficient computation, then c(li), ...)
Finish when new lambdas almost equal to the old ones (say, < 0.01 difference).
61
Some More Technical Hints
• Set V = {all words from training data}.
• You may also consider V = T ∪ H, but it does not make the
coding in any way simpler (in fact, harder).
• But: you must never use the test data for your vocabulary!
• Prepend two “words” in front of all data:
• avoids beginning-of-data problems
• call these index -1 and 0: then the formulas hold exactly
• When cn(w) = 0:
• Assign 0 probability to pn(w|h) where cn-1(h) > 0, but a uniform
probability (1/|V|) to those pn(w|h) where cn-1(h) = 0 [this must
be done both when working on the heldout data during EM, as
well as when computing cross-entropy on the test data!]
62
Introduction to
Natural Language Processing (600.465)
Mutual Information and Word Classes
(class n-gram)
Dr. Jan Hajič
CS Dept., Johns Hopkins Univ.
[email protected]
www.cs.jhu.edu/~hajic
63
The Problem
• Not enough data
• Language Modeling: we do not see “correct” n-grams
– solution so far: smoothing
• suppose we see:
– short homework, short assignment, simple homework
• but not:
– simple assigment
• What happens to our (bigram) LM?
– p(homework | simple) = high probability
– p(assigment | simple) = low probability (smoothed with p(assigment))
– They should be much closer!
64
Word Classes
• Observation: similar words behave in a similar way
– trigram LM:
– in the ... (all nouns/adj);
– catch a ... (all things which can be catched, incl. their accompanying
adjectives);
– trigram LM, conditioning:
– a ... homework (any atribute of homework: short, simple, late, difficult),
– ... the woods (any verb that has the woods as an object: walk, cut, save)
– trigram LM: both:
– a (short,long,difficult,...) (homework,assignment,task,job,...)
65
Solution
• Use the Word Classes as the “reliability” measure
• Example: we see
• short homework, short assignment, simple homework
– but not:
• simple assigment
– Cluster into classes:
• (short, simple) (homework, assignment)
– covers “simple assignment”, too
• Gaining: realistic estimates for unseen n-grams
• Loosing: accuracy (level of detail) within classes
66
The New Model
• Rewrite the n-gram LM using classes:
– Was: [k = 1..n]
• pk(wi|hi) = c(hi,wi) / c(hi) [history: (k-1) words]
– Introduce classes:
pk(wi|hi) = p(wi|ci) pk(ci|hi)
!
• history: classes, too: [for trigram: hi = ci-2,ci-1, bigram: hi = ci-1]
– Smoothing as usual
• over pk(wi|hi), where each is defined as above (except uniform
which stays at 1/|V|)
67
Training Data
• Suppose we already have a mapping:
– r: V → C assigning each word its class (ci = r(wi))
• Expand the training data:
– T = (w1, w2, ..., w|T|) into
– TC = (<w1,r(w1)>, <w2,r(w2)>, ..., <w|T|,r(w|T|)>)
• Effectively, we have two streams of data:
– word stream: w1, w2, ..., w|T|
– class stream: c1, c2, ..., c|T| (def. as ci = r(wi))
• Expand Heldout, Test data too
68
Training the New Model
• As expected, using ML estimates:
– p(wi|ci) = p(wi|r(wi)) = c(wi) / c(r(wi)) = c(wi) / c(ci)
• !!! c(wi,ci) = c(wi) [since ci determined by wi]
– pk(ci|hi):
• p3(ci|hi) = p3(ci|ci-2 ,ci-1) = c(ci-2 ,ci-1,ci) / c(ci-2 ,ci-1)
• p2(ci|hi) = p2(ci|ci-1) = c(ci-1,ci) / c(ci-1)
• p1(ci|hi) = p1(ci) = c(ci) / |T|
• Then smooth as usual
– not the p(wi|ci) nor pk(ci|hi) individually, but the pk(wi|hi)
69
Classes: How To Get Them
• We supposed the classes are given
• Maybe there are in [human] dictionaries, but...
–
–
–
–
dictionaries are incomplete
dictionaries are unreliable
do not define classes as equivalence relation (overlap)
do not define classes suitable for LM
• small, short... maybe; small and difficult?
•  we have to construct them from data (again...)
70
Creating the Word-to-Class Map
• We will talk about bigrams from now
• Bigram estimate:
• p2(ci|hi) = p2(ci|ci-1) = c(ci-1,ci) / c(ci-1) = c(r(wi-1),r(wi)) / c(r(wi-1))
• Form of the model: (class bi-gram)
– just raw bigram for now:
• P(T) = Pi=1..|T|p(wi|r(wi)) p2(r(wi))|r(wi-1)) (p2(c1|c0) =df p(c1))
• Maximize over r (given r → fixed p, p2):
– define objective L(r) = 1/|T| Si=1..|T|log(p(wi|r(wi)) p2(r(wi))|r(wi-1)))
– rbest = argmaxr L(r) (L(r) = norm. logprob of training data... as usual) (or
negative cross entropy)
71
Simplifying the Objective Function
• Start from L(r) = 1/|T| Si=1..|T|log(p(wi|r(wi)) p2(r(wi)|r(wi-1))):
1/|T| Si=1..|T|log(p(wi|r(wi)) p(r(wi)) p2(r(wi)|r(wi-1)) / p(r(wi))) =
1/|T| Si=1..|T|log(p(wi,r(wi)) p2(r(wi)|r(wi-1)) / p(r(wi))) =
1/|T| Si=1..|T|log(p(wi)) + 1/|T| Si=1..|T|log(p2(r(wi)|r(wi-1)) / p(r(wi))) =
-H(W) + 1/|T| Si=1..|T|log(p2(r(wi)|r(wi-1)) p(r(wi-1)) / (p(r(wi-1)) p(r(wi)))) =
-H(W) + 1/|T| Si=1..|T|log(p(r(wi),r(wi-1)) / (p(r(wi-1)) p(r(wi)))) =
-H(W) + Sd,e∈C p(d,e) log( p (d,e) / (p(d) p(e)) ) =
-H(W) + I(D,E)
(event E picks class adjacent (to the right) to the one picked by D)
• Since W does not depend on r, we ended up with I(D,E).
the need to maximize
72
Maximizing Mutual Information
(dependent on the mapping r)
• Result from previous foil:
– Maximizing the probability of data amounts to
maximizing I(D,E), the mutual information of the
adjacent classes.
• Good:
– We know what a MI is, and we know how to maximize.
• Bad:
– There is no way how to maximize over so many
possible partitionings: |V||V| - no way to test them all.
73
The Greedy Algorithm
• Define merging operation on the mapping r: V → C:
– merge: R ⅹ C ⅹ C → R’ ⅹ C-1: (r,k,l) → r’,C’ such that
– C-1 = {C - {k,l} ∪ {m}} (throw out k and l, add new m ∉ C)
– r’(w) = ..... m for w ∈ rINV({k,l}),
..... r(w) otherwise.
• 1. Start with each word in its own class (C = V), r = identity.
• 2. Merge two classes k,l into one, m, such that
(k,l) = argmaxk,l Imerge(r,k,l)(D,E).
• 3. Set new (r,C) = merge(r,k,l).
• 4. Repeat 2 and 3 until |C| reaches predetermined size.
74
Word Classes in Applications
• Word Sense Disambiguation: context not seen
[enough(-times)]
• Parsing: verb-subject, verb-object relations
• Speech recognition (acoustic model): need more
instances of [rare(r)] sequences of phonemes
• Machine Translation: translation equivalent
selection [for rare(r) words]
75
Spelling
Correction and the
Noisy Channel
The Spelling
Correction Task
Dan Jurafsky
Applications for spelling correction
Word processing
Phones
Web search
77
Dan Jurafsky
Spelling Tasks
• Spelling Error Detection
• Spelling Error Correction:
• Autocorrect
• htethe
• Suggest a correction
• Suggestion lists
78
Dan Jurafsky
Types of spelling errors
• Non-word Errors
• graffe giraffe
• Real-word Errors
• Typographical errors
• three there
• Cognitive Errors (homophones)
• piecepeace,
• too  two
79
Dan Jurafsky
Rates of spelling errors
26%: Web queries Wang et al. 2003
13%: Retyping, no backspace: Whitelaw et al. English&German
7%: Words corrected retyping on phone-sized organizer
2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003
1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983
80
Dan Jurafsky
Non-word spelling errors
• Non-word spelling error detection:
• Any word not in a dictionary is an error
• The larger the dictionary the better
• Non-word spelling error correction:
• Generate candidates: real words that are similar to error
• Choose the one which is best:
• Shortest weighted edit distance
• Highest noisy channel probability
81
Dan Jurafsky
Real word spelling errors
• For each word w, generate candidate set:
• Find candidate words with similar pronunciations
• Find candidate words with similar spelling
• Include w in candidate set
• Choose best candidate
• Noisy Channel
• Classifier
82
Spelling
Correction and the
Noisy Channel
The Noisy Channel
Model of Spelling
Dan Jurafsky
Noisy Channel Intuition
84
Dan Jurafsky
Noisy Channel
• We see an observation x of a misspelled word
• Find the correct word w
wˆ = argmax P(w | x)
wÎV
P(x | w)P(w)
= argmax
wÎV
P(x)
= argmax P(x | w)P(w)
85
wÎV
Dan Jurafsky
History: Noisy channel for spelling
proposed around 1990
• IBM
• Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991.
Context based spelling correction. Information Processing
and Management, 23(5), 517–522
• AT&T Bell Labs
• Kernighan, Mark D., Kenneth W. Church, and William A. Gale.
1990. A spelling correction program based on a noisy channel
model. Proceedings of COLING 1990, 205-210
Dan Jurafsky
Non-word spelling error example
acress
87
Dan Jurafsky
Candidate generation
• Words with similar spelling
• Small edit distance to error
• Words with similar pronunciation
• Small edit distance of pronunciation to error
88
Dan Jurafsky
Damerau-Levenshtein edit distance
• Minimal edit distance between two strings, where edits are:
• Insertion
• Deletion
• Substitution
• Transposition of two adjacent letters
89
Dan Jurafsky
Words within 1 of acress
90
Error
Candidate
Correction
Correct
Letter
Error
Letter
Type
acress
actress
t
-
deletion
acress
cress
-
a
insertion
acress
caress
ca
ac
transposition
acress
access
c
r
substitution
acress
across
o
e
substitution
acress
acres
-
s
insertion
acress
acres
-
s
insertion
Dan Jurafsky
Candidate generation
• 80% of errors are within edit distance 1
• Almost all errors within edit distance 2
• Also allow insertion of space or hyphen
• thisidea  this idea
• inlaw  in-law
91
Dan Jurafsky
Language Model
• Use any of the language modeling algorithms we’ve learned
• Unigram, bigram, trigram
• Web-scale spelling correction (web-scale language modeling)
• Stupid backoff
• “Stupid backoff” (Brants et al. 2007)
• No discounting, just use relative frequencies
92
ì
i
count(w
i
i-k+1 )
ïï
if
count(w
i-k+1 ) > 0
i-1
i-1
S(wi | wi-k+1 ) = í count(wi-k+1 )
ï
i-1
0.4S(w
|
w
otherwise
ïî
i
i-k+2 )
count(wi )
S(wi ) =
N
Dan Jurafsky
Unigram Prior probability
Counts from 404,253,213 words in Corpus of Contemporary English (COCA)
word
actress
93
Frequency of
word
P(word)
9,321 .0000230573
cress
220 .0000005442
caress
686 .0000016969
access
37,038 .0000916207
across
120,844 .0002989314
acres
12,874 .0000318463
Dan Jurafsky
Channel model probability
• Error model probability, Edit probability
• Kernighan, Church, Gale 1990
• Misspelled word x = x1, x2, x3… xm
• Correct word w = w1, w2, w3,…, wn
• P(x|w) = probability of the edit
• (deletion/insertion/substitution/transposition)
94
Dan Jurafsky
Computing error probability: confusion
matrix
del[x,y]:
ins[x,y]:
sub[x,y]:
trans[x,y]:
count(xy typed as x)
count(x typed as xy)
count(x typed as y)
count(xy typed as yx)
Insertion and deletion conditioned on previous character
95
Dan Jurafsky
Confusion matrix for spelling errors
Dan Jurafsky
Generating the confusion matrix
• Peter Norvig’s list of errors
• Peter Norvig’s list of counts of single-edit errors
97
Dan Jurafsky
Channel model
98
Kernighan, Church, Gale 1990
Dan Jurafsky
Channel model for acress
Candida
te
Correcti
on
99
Corre Erro
ct
r
Letter Lett
er
x|w
P(x|word)
actres t
s
-
c|ct .000117
cress
a
a|#
caress ca
ac
ac|c .00000164
a
access c
r
r|c
.000000209
across o
e
e|o
.0000093
acres
-
s
es|e .0000321
acres
-
s
ss|s .0000342
-
.00000144
Dan Jurafsky
Noisy channel probability for acress
100
Candidate Correct Error x|w
Correction Letter
Letter
P(x|word)
P(word)
109 *P(x|w)P(w)
actress
t
-
c|ct
.000117
.0000231
2.7
cress
-
a
a|#
.00000144
.00000054 .00078
4
caress
ca
ac
ac|ca
.00000164
.00000170 .0028
access
c
r
r|c
.00000020
9
.0000916
.019
across
o
e
e|o
.0000093
.000299
2.8
acres
-
s
es|e
.0000321
.0000318
1.0
acres
-
s
ss|s
.0000342
.0000318
1.0
Dan Jurafsky
Using a bigram language model
• “a stellar and versatile acress whose
combination of sass and glamour…”
• Counts from the Corpus of Contemporary American English with
add-1 smoothing
• P(actress|versatile)=.000021 P(whose|actress) = .0010
• P(across|versatile) =.000021 P(whose|across) = .000006
• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10
• P(“versatile across whose”) = .000021*.000006 = 1 x10-10
101
Dan Jurafsky
Using a bigram language model
• “a stellar and versatile acress whose
combination of sass and glamour…”
• Counts from the Corpus of Contemporary American English with
add-1 smoothing
• P(actress|versatile)=.000021 P(whose|actress) = .0010
• P(across|versatile) =.000021 P(whose|across) = .000006
• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10
• P(“versatile across whose”) = .000021*.000006 = 1 x10-10
102
Dan Jurafsky
Evaluation
• Some spelling error test sets
•
•
•
•
103
Wikipedia’s list of common English misspelling
Aspell filtered version of that list
Birkbeck spelling error corpus
Peter Norvig’s list of errors (includes Wikipedia and Birkbeck, for training
or testing)
Spelling
Correction and the
Noisy Channel
Real-Word Spelling
Correction
Dan Jurafsky
Real-word spelling errors
•
•
•
•
…leaving in about fifteen minuets to go to her house.
The design an construction of the system…
Can they lave him my messages?
The study was conducted mainly be John Black.
• 25-40% of spelling errors are real words
105
Kukich 1992
Dan Jurafsky
Solving real-world spelling errors
• For each word in sentence
• Generate candidate set
• the word itself
• all single-letter edits that are English words
• words that are homophones
• Choose best candidates
• Noisy channel model
• Task-specific classifier
106
Dan Jurafsky
Noisy channel for real-word spell correction
• Given a sentence w1,w2,w3,…,wn
• Generate a set of candidates for each word wi
• Candidate(w1) = {w1, w’1 , w’’1 , w’’’1 ,…}
• Candidate(w2) = {w2, w’2 , w’’2 , w’’’2 ,…}
• Candidate(wn) = {wn, w’n , w’’n , w’’’n ,…}
• Choose the sequence W that maximizes P(W)
Dan Jurafsky
Noisy channel for real-word spell correction
two
of
to
108
thew
threw
tao
off
thaw
too
on
the
two
of
thaw
...
Dan Jurafsky
Noisy channel for real-word spell correction
two
of
to
109
thew
threw
tao
off
thew
too
on
the
two
of
thaw
...
Dan Jurafsky
Simplification: One error per sentence
• Out of all possible sentences with one word replaced
•
•
•
•
w1, w’’2,w3,w4
w1,w2,w’3,w4
w’’’1,w2,w3,w4
…
two off thew
two of the
too of thew
• Choose the sequence W that maximizes P(W)
Dan Jurafsky
Where to get the probabilities
• Language model
• Unigram
• Bigram
• Etc
• Channel model
• Same as for non-word spelling correction
• Plus need probability for no error, P(w|w)
111
Dan Jurafsky
Probability of no error
• What is the channel probability for a correctly typed word?
• P(“the”|“the”)
• Obviously this depends on the application
•
•
•
•
112
.90 (1 error in 10 words)
.95 (1 error in 20 words)
.99 (1 error in 100 words)
.995 (1 error in 200 words)
Dan Jurafsky
Peter Norvig’s “thew” example
x
w
x|w P(x|w)
thew
the
ew|e 0.000007 0.02
thew
thew
0.95
0.00000009 90
thew
thaw e|a
0.001
0.0000007 0.7
thew
threw h|hr 0.000008 0.000004 0.03
ew|w
0.000003 0.00000004 0.0001
thwe e
thew
113
P(w)
109
P(x|w)P(w)
144
Spelling
Correction and the
Noisy Channel
State-of-the-art
Systems
Dan Jurafsky
HCI issues in spelling
• If very confident in correction
• Autocorrect
• Less confident
• Give the best correction
• Less confident
• Give a correction list
• Unconfident
• Just flag as an error
115
Dan Jurafsky
State of the art noisy channel
• We never just multiply the prior and the error model
• Independence assumptionsprobabilities not commensurate
• Instead: Weigh them
wˆ = argmax P(x | w)P(w)
wÎV
• Learn λ from a development test set
116
l
Dan Jurafsky
Phonetic error model
• Metaphone, used in GNU aspell
• Convert misspelling to metaphone pronunciation
•
•
•
•
“Drop duplicate adjacent letters, except for C.”
“If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.”
“Drop 'B' if after 'M' and if it is at the end of the word”
…
• Find words whose pronunciation is 1-2 edit distance from misspelling’s
• Score result list
• Weighted edit distance of candidate to misspelling
• Edit distance of candidate pronunciation to misspelling pronunciation
117
Dan Jurafsky
Improvements to channel model
• Allow richer edits (Brill and Moore 2000)
• entant
• phf
• leal
• Incorporate pronunciation into channel (Toutanova and Moore
2002)
118
Dan Jurafsky
Channel model
• Factors that could influence p(misspelling|word)
•
•
•
•
•
•
•
•
119
The source letter
The target letter
Surrounding letters
The position in the word
Nearby keys on the keyboard
Homology on the keyboard
Pronunciations
Likely morpheme transformations
Dan Jurafsky
Nearby keys
Dan Jurafsky
Classifier-based methods
for real-word spelling correction
• Instead of just channel model and language model
• Use many features in a classifier such as MaxEnt, CRF.
• Build a classifier for a specific pair like:
whether/weather
• “cloudy” within +- 10 words
• ___ to VERB
• ___ or not
121