Stemming, tagging and chunking

Download Report

Transcript Stemming, tagging and chunking

Tagging – more details
D Jurafsky & J H Martin (2000) Speech and Language
Processing, Ch 8
R Dale et al (2000) Handbook of Natural Language
Processing, Ch 17
C D Manning & H Schütze (1999) Foundations of
Statistical Natural Language Processing, Ch 10
POS tagging - overview
• What is a “tagger”?
• Tagsets
• How to build a tagger and how a tagger
– Supervised vs unsupervised learning
– Rule-based vs stochastic
– And some details
What is a tagger?
• Lack of distinction between …
– Software which allows you to create something you
can then use to tag input text, e.g. “Brill’s tagger”
– The result of running such software, e.g. a tagger for
English (based on the such-and-such corpus)
• Taggers (even rule-based ones) are almost
invariably trained on a given corpus
• “Tagging” usually understood to mean “POS
tagging”, but you can have other types of tags
(eg semantic tags)
Tagging vs. parsing
• Once tagger is “trained”, process consists
straightforward look-up, plus local context
(and sometimes morphology)
• Will attempt to assign a tag to unknown
words, and to disambiguate homographs
• “Tagset” (list of categories) usually larger
with more distinctions
• Parsing usually has basic word-categories,
whereas tagging makes more subtle
• E.g. noun sg vs pl vs genitive, common vs
proper, +is, +has, … and all combinations
• Parser uses maybe 12-20 categories,
tagger may use 60-100
Simple taggers
• Default tagger has one tag per word, and
assigns it on the basis of dictionary lookup
– Tags may indicate ambiguity but not resolve it, e.g.
nvb for noun-or-verb
• Words may be assigned different tags with
associated probabilities
– Tagger will assign most probable tag unless
– there is some way to identify when a less probable
tag is in fact correct
• Tag sequences may be defined by regular
expressions, and assigned probabilities
(including 0 for illegal sequences)
What probabilities do we have to learn?
(a)Individual word probabilities:
Probability that a given tag t is appropriate
for a given word w
– Easy (in principle): learn from training
f (t , w)
P(t | w) 
f ( w)
– Problem of “sparse data”:
• Add a small amount to each calculation, so we
get no zeros
(b) Tag sequence probability:
Probability that a given tag sequence
t1,t2,…,tn is appropriate for a given word
sequence w1,w2,…,wn
– P(t1,t2,…,tn | w1,w2,…,wn ) = ???
– Too hard to calculate entire sequence:
P(t1,t2 ,t3 ,t4 , …) = P(t2|t1 )  P(t3|t1,t2 )  P(t4|t1,t2 ,t3 )  …
– Subsequence is more tractable
– Sequence of 2 or 3 should be enough:
Bigram model: P(t1,t2) = P(t2|t1 )
Trigram model: P(t1,t2 ,t3) = P(t2|t1 )  P(t3|t2 )
N-gram model: P(t1 ,...,tn )   P(ti | ti 1 )
i 1,n
More complex taggers
• Bigram taggers assign tags on the basis of
sequences of two words (usually assigning
tag to wordn on the basis of wordn-1)
• An nth-order tagger assigns tags on the
basis of sequences of n words
• As the value of n increases, so does the
complexity of the statistical calculation
involved in comparing probability
Trigram Tagger
Efficient HMM
Sparse Data
Greene and Rubin
Rule Based - 70%
Brown Corpus
Created (EN-US)
1 Million Words
HMM Tagging
Brown Corpus
LOB Corpus
Created (EN-UK)
1 Million Words
Tree-Based Statistics
(Helmut Shmid)
Rule Based – 96%+
Based Tagging
(Eric Brill)
Rule Based – 95%+
Combined Methods
Neural Network
LOB Corpus
POS Tagging
separated from
other NLP
Penn Treebank
(WSJ, 4.5M)
British National
(tagged by CLAWS)
How do they work?
Tagger must be “trained”
Many different techniques, but typically …
Small “training corpus” hand-tagged
Tagging rules learned automatically
Rules define most likely sequence of tags
Rules based on
– Internal evidence (morphology)
– External evidence (context)
Rule-based taggers
• Earliest type of tagging: two stages
• Stage 1: look up word in lexicon to give list
of potential POSs
• Stage 2: Apply rules which certify or
disallow tag sequences
• Rules originally handwritten; more recently
Machine Learning methods can be used
• cf transformation-based learning, below
Stochastic taggers
• Nowadays, pretty much all taggers are
statistics-based and have been since
1980s (or even earlier ... Some primitive
algorithms were already published in 60s
and 70s)
• Most common is based on Hidden markov
Models (also found in speech processing,
(Hidden) Markov Models
• Probability calculations imply Markov models: we
assume that P(t|w) is dependent only on the (or, a
sequence of) previous word(s)
• (Informally) Markov models are the class of probabilistic
models that assume we can predict the future without
taking too much account of the past
• Markov chains can be modelled by finite state automata:
the next state in a Markov chain is always dependent on
some finite history of previous states
• Model is “hidden” if it is actually a succession of Markov
models, whose intermediate states are of no interest
Three stages of HMM training
• Estimating likelihoods on the basis of a
corpus: Forward-backward algorithm
• “Decoding”: applying the process to a
given input: Viterbi algorithm
• Learning (training): Baum-Welch algorithm
or Iterative Viterbi
Forward-backward algorithm
• Denote At s   Pw1 wt , statet  s 
• Claim: At 1 s    At q Ps q Pwt 1 s 
• Therefore we can calculate all At(s) in time
• Similar, by going backwards, we can get:
Bt s  Pwt 1 wL statet  s
• Multiplying we can get: Pw1 wL , statet  s 
• Note that summing this for all states at a time t
gives the likelihood of w1…wL.
Viterbi algorithm
(aka Dynamic programming)
(see J&M p177ff)
• Denote Qt s  Best state sequenceending with state s at time t
• Claim: Qt 1 s 1t  Qt Qt 1 s t 
• Otherwise, appending s to the prefix would get a
path better than Qt+1(s).
• Therefore, checking all possible states q at time t,
multiplying by the transition probability between q and
s and the expression probability of wt+1 given s, and
finding the maximum, gives Qt+1(s).
• We need to store for each state the previous state in
• Find the maximal finish state, and reconstruct the
• O(L*Tn) instead of TL.
Baum-Welch algorithm
• Start with initial HMM
• Calculate, using F-B, the likelihood to get
our observations given that a certain
hidden state was used at time i.
• Re-estimate the HMM parameters
• Continue until convergence
• Can be shown to constantly improve
Unsupervised learning
• We have an untagged corpus
• We may also have partial information such
as a set of tags, a dictionary, knowledge of
tag transitions, etc.
• Use Baum-Welch to estimate both the
context probabilities and the lexical
Supervised learning
• Use a tagged corpus
• Count the frequencies of tag-pairs t,w: C(t,w)
• Estimate (Maximum Likelihood Estimate):
Pwi ti  
C t , w
C t 
 C t    C t , w
• Count the frequencies of tag n-grams C(t1…tn)
• Estimate (Maximum Likelihood Estimate):
C ti n1 ti 
Pti ti n1 ti 1  
C ti n1 ti 1 
• What about small counts? Zero counts?
Sparse Training Data Smoothing
• Adding a bias:
C ' t , w  C (t , w)  
P' t w 
C t , w  
C w  T
Compensates for estimation (Bayesean approach)
Has larger effect on low-count words
Solves zero-count word problem
Generalized Smoothing:
Pt w  1 w
C t , w
 1  1 w f t , w
C w
, f probability m easureovert
• Reduces to bias using:
1 w 
C w
C w  T
f t , w 
Decision-tree tagging
• Not all n-grams are created equal:
– Some n-grams contain redundant information
that may be expressed well enough with less
– Some n-grams are too sparse
• Decision Tree (Schmid, 1994)
Decision Trees
Each node is a binary test of tag ti-k.
The leaves store probabilities for ti.
All HMM algorithms can still be used
– Build tree from root to leaves
– Choose tests for nodes that
maximize information gain
– Stop when branch too sparse
– Finally, prune tree
Transformation-based learning
• Eric Brill (1993)
• Start from an initial tagging, and apply a
series of transformations
• Transformations are learned as well, from
the training data
• Captures the tagging data in much fewer
parameters than stochastic models
• The transformations learned have
linguistic meaning
Transformation-based learning
• Examples: Change tag a to b when:
– The preceding (following) word is tagged z
– The word two before (after) is tagged z
– One of the 2 preceding (following) words is
tagged z
– The preceding word is tagged z and the
following word is tagged w
– The preceding (following) word is W
Transformation-based Tagger: Learning
• Start with initial tagging
• Score the possible
transformations by
comparing their result to
the “truth”.
• Choose the
transformation that
maximizes the score
• Repeat last 2 steps