UCSD Ling/CSE 256 Winter 2009

Transcript UCSD Ling/CSE 256 Winter 2009

Statistical NLP
Winter 2009
Lecture 5: Unsupervised Learning I
Unsupervised Word Segmentation
Roger Levy
[thanks to Sharon Goldwater for many slides]
Supervised training
• Standard statistical systems use a supervised
paradigm.
Training:
Labeled
training
data
Statistics
Machine
learning
system
Prediction
procedure
The real story
• Annotating labeled data is labor-intensive!!!
Training:
Human effort
Labeled
training
data
Statistics
Machine
learning
system
Prediction
procedure
The real story (II)
• This also means that moving to a new language,
domain, or even genre can be difficult.
• But unlabeled data is cheap!
• It would be nice to use the unlabeled data directly to
learn the labelings you want in your model.
• Today we’ll look at methods for doing exactly this.
Today’s plan
• We’ll illustrate unsupervised learning with the
“laboratory” (but cognitively hugely important!) task of
word segmentation
• We’ll use Bayesian methods for unsupervised learning
Learning structured models
• Most of the models we’ll look at in this class are
structured
•
•
•
•
Tagging
Parsing
Role labeling
Coreference
• The structure is latent
• With raw data, we have to construct models that will
be rewarded for inferring that latent structure
A very simple example
• Suppose that we observe the following counts
A
B
C
D
9
9
1
1
• Suppose we are told that these counts arose from
tossing two coins, each with a different label on each
side
• For example, coin 1 might be A/C, coin 2 B/D
• Suppose further that we are told that the coins are not
extremely unfair
• There is an intuitive solution; how can we learn it?
A very simple example (II)
• Suppose we fully parameterize the model:
A
B
C
D
9
9
1
1
• The MLE of this solution is totally degenerate: it cannot
distinguish which letters should be paired on a coin
• Convince yourself of this!
• We need to specify more constraints on the model
• The general idea would be to place priors on the model
parameters
• An extreme variant: force p1=p2=0.5
A very simple example (III)
• An extreme variant: force p1=p2=0.5
• This forces structure into the model
A
B
C
D
9
9
1
1
• It also makes it easy to visualize the log-likelihood as a
function of the remaining free parameter π
• The intuitive solution is found!
The EM algorithm
• In the two-coin example, we were able to explore the
likelihood surface exhaustively:
• Enumerating all possible model structures
• Analytically deriving the MLE for each model structure
• Picking the model structure with best MLE
• In general, however, latent structure often makes
direct analysis of the likelihood surface intractable or
impossible
• [mixture of Gaussians example?]
The EM algorithm
• In cases of an unanalyzable likelihood function, we
want to use hill-climbing techniques to find good points
on the likelihood surface
• Some of these fall under the category of iterative
numerical optimization
• In our case, we’ll look at a general-purpose tool that is
guaranteed “not to do bad things”: the ExpectationMaximization (EM) algorithm
Adding a Bayesian Prior
• For model (w, t, θ), try to find the optimal value for θ
using Bayes’ rule:
P( | w)  P( w |  ) P( )
likelihood prior
posterior
• Two standard objective functions are
• Maximum-likelihood estimation (MLE):
 *  argmax P( w |  )

• Maximum a posteriori (MAP) estimation:
 *  argmaxP( w |  ) P( )

Dirichlet priors
• For multinomial distributions, the Dirichlet makes a
natural prior.
A symmetric Dirichlet(β) prior
over θ = (θ1, θ2):
• β > 1: prefer uniform distributions
• β = 1: no preference
• β < 1: prefer sparse (skewed) distributions
MAP estimation with EM
• We have already seen how to do ML estimation with
the Expectation-Maximization Algorithm
• We can also do MAP estimation with the appropriate
type of prior
• MAP estimation affects the M-step of EM
• For example, with a Dirichlet prior, the MAP estimate can
be calculated by treating the prior parameters as
“pseudo-counts”
(Beal 2003)
More than just the MAP
• Why do we want to estimate θ?
• Prediction: estimate P(wn+1|θ).
• Structure recovery: estimate P(t|θ,w).
• To the true Bayesian, the model θ parameters should
really be marginalized out:
• Prediction: estimate
• Structure: estimate
• We don’t want to choose model parameters if we can
avoid it
Bayesian integration
• When we integrate over the parameters θ, we gain
• Robustness: values of hidden variables will have high
probability over a range of θ.
• Flexibility: allows wider choice of priors, including priors
favoring sparse solutions.
Integration example
Suppose we want to estimate
where
• P(θ|w) is broad:
• P(t = 1|θ,w) is peaked:
Estimating t based on fixed θ* favors t = 1, but for
many probable values of θ, t = 0 is a better choice.
Sparse distributions
In language learning, sparse distributions are often
preferable (e.g., conditional distributions like P(wi|wi-1))
• Problem: when β < 1, setting any θk = 0 makes P(θ) →
∞ regardless of other θj.
• Solution: instead of fixing θ, integrate:
A new problem
• So far, we have assumed that all models under
consideration have the same number of parameters.
• Topic models: fixed number of topics, fixed number of
words.
• What if we want to compare models of different sizes?
• Standard method: use ML or MAP estimation to find
optimal models using different numbers of parameters,
then compare using model selection (e.g., likelihood ratio
test).
• But… not always practical or even possible.
Word segmentation
• Given a corpus of fluent speech or text (no word
boundaries), we want to identify the words.
whatsthat
thedoggie
yeah
wheresthedoggie
whats that
the doggie
yeah
wheres the doggie
• Early language acquisition task for human infants.
see
the
doggie
Word segmentation
• Given a corpus of fluent speech or text (no word
boundaries), we want to identify the words.
whatsthat
thedoggie
yeah
wheresthedoggie
whats that
the doggie
yeah
wheres the doggie
• Preprocessing step for many Asian languages.
Maximum-likelihood word segmentation
• Consider using a standard n-gram model (Venkataraman, 2001).
n
P( w1...wn )   P( wi )
(unigram model)
i 1
• Hypothesis space: all possible segmentations of the input
corpus, with probabilities for each inferred lexical item.
• If model size is unconstrained, the ML solution is trivial.
• Each sentence is memorized as a “word”, with probability
equal to its empirical probability.
• Adding boundaries shifts probability mass to unseen
sentences.
• Explicitly computing and comparing solutions with all
possible sets of words is intractable.
• Any non-trivial solution is due to implicit constraints of
search.
A possible solution
• Brent (1999) defines a Bayesian model for word
segmentation.
• Prior favors solutions with fewer and shorter lexical
items.
• Uses approximate online algorithm to search for the
MAP solution.
• Problems:
• Model is difficult to modify or extend in any interesting
way.
• Results are influenced by special-purpose search
algorithm.
Can we do better?
• We would like to develop models that
• Allow comparison between solutions of varying
complexity.
• Use standard search algorithms.
• Can be easily modified and extended to explore the
effects of different assumptions.
• Properties of lexical items (What does a typical word look
like?)
• Properties of word behavior (How often is a word likely to
appear?)
• Using Bayesian techniques, we can!
A new model
• Assume words w = {w1 … wn} are generated as
follows:
• G: is analogous to θ in BHMM, but with infinite
dimension. As with θ, we integrate it out.
• DP(α0, P0): a Dirichlet process with concentration
parameter α0 and base distribution P0.
Recap: the Dirichlet distribution
• It’ll be useful to briefly revisit the Dirichlet distribution in
order to get to the Dirichlet process
• The k-class Dirichlet distribution is a probability
distribution over k-class multinomial distributions with
parameters αi
• The normalizing constant Z is
• Symmetric Dirichlet: all αi are set to the same value α
The Dirichlet process
• Our generative model for word segmentation assumes
data (words!) arise from clusters. Defines
• A distribution over the number and size of the clusters.
• A distribution P0 over the parameters describing the
distribution of data in each cluster.
• Clusters = frequencies of different words.
• Cluster parameters = identities of different words.
The Chinese restaurant process
• In the DP, the number of items in each cluster is
defined by the Chinese restaurant process:
• Restaurant has an infinite number of tables, with infinite
seating capacity.
• The table chosen by the ith customer, zi, depends on
the seating arrangement of the previous i − 1
customers :
• CRP produces a power-law distribution over cluster
sizes.
The two-stage restaurant
1. Assign data points to clusters (tables).
…
2. Sample labels for tables using P0.
see
dog
the
is
cat
on
…
Alternative view
• Equivalently, words are generated sequentially using a
cache model: previously generated words are more
likely to be generated again.
see
… the
dog
dog is on
the
is
cat
… on
…
Unigram model
• DP model yields the following distribution over
words:
n  P0 ( w)
P ( wi  w | w i )  w
i 1 
m
with P0 ( w  x1...xm )   P( xi ) for characters x1…xm.
i 1
• P0 favors shorter lexical items.
• Words are not independent, but are exchangeable: a
unigram model.
P(w1, w2, w3, w4) = P(w2, w4, w1, w3)
• Input corpus contains utterance boundaries. We
assume a geometric distribution on utterance
lengths.
Unigram model, in more detail
• How a corpus comes into being…
• First, a probability distribution over corpus length N:
• Next, a probability distribution for the type identity of
each new word:
• Finally, a probability distribution over the phonological
form of each word type
Advantages of DP language models
• Solutions are sparse, yet grow as data size grows.
• Smaller values α0 of lead to fewer lexical items
generated by P0.
• Models lexical items separately from frequencies.
• Different choices for P0 can infer different kinds of
linguistic structure.
• Amenable to standard search procedures (e.g., Gibbs
sampling).
Gibbs sampling
• Compare pairs of hypotheses differing by a single
word boundary:
whats.that
the.doggie
yeah
wheres.the.doggie
…
whats.that
the.dog.gie
yeah
wheres.the.doggie
…
• Calculate the probabilities of the words that differ,
given current analysis of all other words.
• Sample a hypothesis according to the ratio of
probabilities.
Experiments
• Input: same corpus as Brent (1999), Venkataraman
(2001).
• 9790 utterances of transcribed child-directed speech.
• Example input:
youwanttoseethebook
looktheresaboywithhishat
andadoggie
youwanttolookatthis
...
• Using different values of α0, evaluate on a single
sample after 20k iterations.
Example results
youwant to see thebook
look theres aboy with his hat
and adoggie
you wantto lookatthis
lookatthis
havea drink
okay now
whatsthis
whatsthat
whatisit
look canyou take itout
...
Quantitative evaluation
Boundaries
Prec Rec
Word tokens
Prec Rec
Venk. (2001)
80.6
84.8
67.7
70.2
Brent (1999)
80.3
84.3
67.0
69.4
DP model
92.4
62.2
61.9
47.6
Precision: #correct / #found
Recall: #found / #true
• Proposed boundaries are more accurate than other
models, but fewer proposals are made.
• Result: lower accuracy on words.
What happened?
• DP model assumes (falsely) that words have the same
probability regardless of context.
P(that) = .024
P(that|whats) = .46
P(that|to) = .0019
• Positing collocations allows the model to capture wordto-word dependencies.
What about other unigram models?
• Venkataraman’s system is based on MLE, so we
know results are due to constraints imposed by
search.
• Brent’s search algorithm also yields non-optimal
solution.
• Our solution has higher probability under his model
than his own solution does.
• On randomly permuted corpus, our system achieves
96% accuracy; Brent gets 81%.
• Any (reasonable) unigram model will undersegment.
• Bottom line: previous results were accidental
properties of search, not systematic properties of
models.
Improving the model
• By incorporating context (using a bigram model),
perhaps we can improve segmentation…
Hierachical Dirichlet process
1. Generate G, a distribution over words, using DP(α0,
P0).
dog
and
girl
the
red
$
car
…
Adding bigram dependencies
2. For each word in the data, generate a distribution
over the words that follow it, using DP(α1, G).
dog
and
girl
the
red
see
and
the
a
$
dog
$
and
is
the
dog
car
girl
red
$
car
…
Example results
you want to see the book
look theres a boy with his hat
and a doggie
you want to lookat this
lookat this
have a drink
okay now
whats this
whats that
whatis it
look canyou take it out
...
Quantitative evaluation
Boundaries
Prec Rec
Word tokens Lexicon
Prec Rec
Prec Rec
92.4
62.2
61.9
47.6
57.0
57.5
Venk. (bigram) 81.7
82.5
68.1
68.6
54.5
57.0
HDP (bigram)
83.8
75.7
72.1
63.1
50.3
DP (unigram)
89.9
• With appropriate choice of α and β,
• Boundary precision nearly as good as unigram, recall
much better.
• F-score (avg. of prec, rec) on all three measures
outperforms all previously published models.
Summary: word segmentation
• Our approach to word segmentation using infinite
Bayesian models allowed us to
• Incorporate sensible priors to avoid trivial solutions (à la
MLE).
• Examine the effects of modeling assumptions without
limitations from search.
• Demonstrate the importance of context for word
segmentation.
• Achieve the best published results on this corpus.
Conclusion
• Bayesian methods are a promising way to improve
unsupervised learning and increase coverage of NLP.
• Bayesian integration increases robustness to noisy data.
• Use of priors allows us to prefer sparse solutions (and
explore other possible constraints).
• Infinite (nonparametric) models grow with the size of the
data.
• POS tagging and word segmentation are specific
examples, but similar methods can be applied to a
variety of structure learning tasks.
Posterior inference w/ Gibbs Sampling
• The theory of Markov Chain Monte Carlo sampling
says that if we do this type of resampling for a long
time, we will converge to the true posterior distribution
over labels:
• Initialize the tag sequence however you want
• Iterate through the sequence many times, each time
sampling over
Experiments of Goldwater & Griffiths 2006
• Vary α, β using standard “unsupervised” POS tagging
methodology:
• Tag dictionary lists possible tags for each word (based
on ~1m words of Wall Street Journal corpus).
• Train and test on unlabeled corpus (24,000 words of
WSJ).
• 53.6% of word tokens have multiple possible tags.
• Average number of tags per token = 2.3.
• Compare tagging accuracy to other methods.
• HMM with maximum-likelihood estimation using EM
(MLHMM).
• Conditional Random Field with contrastive estimation
(CRF/CE) (Smith & Eisner, 2005).
Results
MLHMM
74.7
BHMM (α = 1, β = 1)
83.9
BHMM (best: α = .003, β = 1)
86.8
CRF/CE (best)
90.1
• Transition hyperparameter α has more effect than
output hyperparameter β.
• Smaller α enforces sparse transition matrix, improves
scores.
• Less effect of β due to more varying output distributions?
• Even uniform priors outperform MLHMM (due to
integration).
Hyperparameter inference
• Selecting hyperparameters based on performance is
problematic.
• Violates unsupervised assumption.
• Time-consuming.
• Bayesian framework allows us to infer values
automatically.
• Add uniform priors over the hyperparameters.
• Resample each hyperparameter after each Gibbs
iteration.
• Results: slightly worse than oracle (84.4% vs. 86.8%),
but still well above MLHMM (74.7%).
Reducing lexical resources
Experiments inspired by Smith & Eisner (2005):
• Collapse 45 treebank tags onto smaller set of 17.
• Create several dictionaries of varying quality.
• Words appearing at least d times in 24k training corpus
are listed in dictionary (d = 1, 2, 3, 5, 10, ∞).
• Words appearing fewer than d times can belong to any
class.
• Since standard accuracy measure requires labeled
classes, we measure using best many-to-one
matching of classes.
Results
• BHMM outperforms MLHMM for all dictionary levels,
more so with smaller dictionaries:
d=
1
2
3
5
10
∞
MLHMM
90.6
78.2
74.7
70.5
65.4
34.7
BHMM
91.7
83.7
80.0
77.1
72.8
63.3
• (results are using inference on hyperparameters).
Clustering results
BHMM:
MLHMM:
• MLHMM groups tokens of the same lexical item
together.
• BHMM clusters are more coherent, more variable in
size. Errors are often sensible (e.g. separating
common nouns/proper nouns, confusing
determiners/adjectives, prepositions/participles).
Summary
• Using Bayesian techniques with a standard model
dramatically improves unsupervised POS tagging.
• Integration over parameters adds robustness to
estimates of hidden variables.
• Use of priors allows preference for sparse distributions
typical of natural language.
• Especially helpful when learning is less constrained.