Stemming, tagging and chunking

Download Report

Transcript Stemming, tagging and chunking

Tagging – more details
Reading:
D Jurafsky & J H Martin (2000) Speech and Language
Processing, Ch 8
R Dale et al (2000) Handbook of Natural Language
Processing, Ch 17
C D Manning & H Schütze (1999) Foundations of
Statistical Natural Language Processing, Ch 10
POS tagging - overview
• What is a “tagger”?
• Tagsets
• How to build a tagger and how a tagger
works
– Supervised vs unsupervised learning
– Rule-based vs stochastic
– And some details
What is a tagger?
• Lack of distinction between …
– Software which allows you to create something you
can then use to tag input text, e.g. “Brill’s tagger”
– The result of running such software, e.g. a tagger for
English (based on the such-and-such corpus)
• Taggers (even rule-based ones) are almost
invariably trained on a given corpus
• “Tagging” usually understood to mean “POS
tagging”, but you can have other types of tags
(eg semantic tags)
Tagging vs. parsing
• Once tagger is “trained”, process consists
straightforward look-up, plus local context
(and sometimes morphology)
• Will attempt to assign a tag to unknown
words, and to disambiguate homographs
• “Tagset” (list of categories) usually larger
with more distinctions
Tagset
• Parsing usually has basic word-categories,
whereas tagging makes more subtle
distinctions
• E.g. noun sg vs pl vs genitive, common vs
proper, +is, +has, … and all combinations
• Parser uses maybe 12-20 categories,
tagger may use 60-100
Simple taggers
• Default tagger has one tag per word, and
assigns it on the basis of dictionary lookup
– Tags may indicate ambiguity but not resolve it, e.g.
nvb for noun-or-verb
• Words may be assigned different tags with
associated probabilities
– Tagger will assign most probable tag unless
– there is some way to identify when a less probable
tag is in fact correct
• Tag sequences may be defined by regular
expressions, and assigned probabilities
(including 0 for illegal sequences)
What probabilities do we have to learn?
(a)Individual word probabilities:
Probability that a given tag t is appropriate
for a given word w
– Easy (in principle): learn from training
corpus:
f (t , w)
P(t | w) 
f ( w)
– Problem of “sparse data”:
• Add a small amount to each calculation, so we
get no zeros
(b) Tag sequence probability:
Probability that a given tag sequence
t1,t2,…,tn is appropriate for a given word
sequence w1,w2,…,wn
– P(t1,t2,…,tn | w1,w2,…,wn ) = ???
– Too hard to calculate entire sequence:
P(t1,t2 ,t3 ,t4 , …) = P(t2|t1 )  P(t3|t1,t2 )  P(t4|t1,t2 ,t3 )  …
– Subsequence is more tractable
– Sequence of 2 or 3 should be enough:
Bigram model: P(t1,t2) = P(t2|t1 )
Trigram model: P(t1,t2 ,t3) = P(t2|t1 )  P(t3|t2 )
N-gram model: P(t1 ,...,tn )   P(ti | ti 1 )
i 1,n
More complex taggers
• Bigram taggers assign tags on the basis of
sequences of two words (usually assigning
tag to wordn on the basis of wordn-1)
• An nth-order tagger assigns tags on the
basis of sequences of n words
• As the value of n increases, so does the
complexity of the statistical calculation
involved in comparing probability
combinations
History
Trigram Tagger
(Kempe)
96%+
DeRose/Church
Efficient HMM
Sparse Data
95%+
Greene and Rubin
Rule Based - 70%
1960
Brown Corpus
Created (EN-US)
1 Million Words
HMM Tagging
(CLAWS)
93%-95%
1970
Brown Corpus
Tagged
LOB Corpus
Created (EN-UK)
1 Million Words
Tree-Based Statistics
(Helmut Shmid)
Rule Based – 96%+
Transformation
Based Tagging
(Eric Brill)
Rule Based – 95%+
1980
Combined Methods
98%+
Neural Network
96%+
1990
2000
LOB Corpus
Tagged
POS Tagging
separated from
other NLP
Penn Treebank
Corpus
(WSJ, 4.5M)
British National
Corpus
(tagged by CLAWS)
How do they work?
•
•
•
•
•
•
Tagger must be “trained”
Many different techniques, but typically …
Small “training corpus” hand-tagged
Tagging rules learned automatically
Rules define most likely sequence of tags
Rules based on
– Internal evidence (morphology)
– External evidence (context)
Rule-based taggers
• Earliest type of tagging: two stages
• Stage 1: look up word in lexicon to give list
of potential POSs
• Stage 2: Apply rules which certify or
disallow tag sequences
• Rules originally handwritten; more recently
Machine Learning methods can be used
• cf transformation-based learning, below
Stochastic taggers
• Nowadays, pretty much all taggers are
statistics-based and have been since
1980s (or even earlier ... Some primitive
algorithms were already published in 60s
and 70s)
• Most common is based on Hidden markov
Models (also found in speech processing,
etc.)
(Hidden) Markov Models
• Probability calculations imply Markov models: we
assume that P(t|w) is dependent only on the (or, a
sequence of) previous word(s)
• (Informally) Markov models are the class of probabilistic
models that assume we can predict the future without
taking too much account of the past
• Markov chains can be modelled by finite state automata:
the next state in a Markov chain is always dependent on
some finite history of previous states
• Model is “hidden” if it is actually a succession of Markov
models, whose intermediate states are of no interest
Three stages of HMM training
• Estimating likelihoods on the basis of a
corpus: Forward-backward algorithm
• “Decoding”: applying the process to a
given input: Viterbi algorithm
• Learning (training): Baum-Welch algorithm
or Iterative Viterbi
Forward-backward algorithm
• Denote At s   Pw1 wt , statet  s 
• Claim: At 1 s    At q Ps q Pwt 1 s 
q
• Therefore we can calculate all At(s) in time
O(L*Tn).
• Similar, by going backwards, we can get:
Bt s  Pwt 1 wL statet  s
• Multiplying we can get: Pw1 wL , statet  s 
• Note that summing this for all states at a time t
gives the likelihood of w1…wL.
Viterbi algorithm
(aka Dynamic programming)
(see J&M p177ff)
• Denote Qt s  Best state sequenceending with state s at time t
• Claim: Qt 1 s 1t  Qt Qt 1 s t 
• Otherwise, appending s to the prefix would get a
path better than Qt+1(s).
• Therefore, checking all possible states q at time t,
multiplying by the transition probability between q and
s and the expression probability of wt+1 given s, and
finding the maximum, gives Qt+1(s).
• We need to store for each state the previous state in
Qt(s).
• Find the maximal finish state, and reconstruct the
path.
• O(L*Tn) instead of TL.
Baum-Welch algorithm
• Start with initial HMM
• Calculate, using F-B, the likelihood to get
our observations given that a certain
hidden state was used at time i.
• Re-estimate the HMM parameters
• Continue until convergence
• Can be shown to constantly improve
likelihood
Unsupervised learning
• We have an untagged corpus
• We may also have partial information such
as a set of tags, a dictionary, knowledge of
tag transitions, etc.
• Use Baum-Welch to estimate both the
context probabilities and the lexical
probabilities
Supervised learning
• Use a tagged corpus
• Count the frequencies of tag-pairs t,w: C(t,w)
• Estimate (Maximum Likelihood Estimate):
Pwi ti  
C t , w
C t 


 C t    C t , w
w


• Count the frequencies of tag n-grams C(t1…tn)
• Estimate (Maximum Likelihood Estimate):
C ti n1 ti 
Pti ti n1 ti 1  
C ti n1 ti 1 
• What about small counts? Zero counts?
Sparse Training Data Smoothing
• Adding a bias:
C ' t , w  C (t , w)  
•
•
•
•

P' t w 
C t , w  
C w  T
Compensates for estimation (Bayesean approach)
Has larger effect on low-count words
Solves zero-count word problem
Generalized Smoothing:
Pt w  1 w
C t , w
 1  1 w f t , w
C w
, f probability m easureovert
• Reduces to bias using:
1 w 
C w
C w  T
f t , w 
1
T
Decision-tree tagging
• Not all n-grams are created equal:
– Some n-grams contain redundant information
that may be expressed well enough with less
tags
– Some n-grams are too sparse
• Decision Tree (Schmid, 1994)
Decision Trees
•
•
•
•
Each node is a binary test of tag ti-k.
The leaves store probabilities for ti.
All HMM algorithms can still be used
Learning:
– Build tree from root to leaves
– Choose tests for nodes that
maximize information gain
– Stop when branch too sparse
– Finally, prune tree
Transformation-based learning
• Eric Brill (1993)
• Start from an initial tagging, and apply a
series of transformations
• Transformations are learned as well, from
the training data
• Captures the tagging data in much fewer
parameters than stochastic models
• The transformations learned have
linguistic meaning
Transformation-based learning
• Examples: Change tag a to b when:
– The preceding (following) word is tagged z
– The word two before (after) is tagged z
– One of the 2 preceding (following) words is
tagged z
– The preceding word is tagged z and the
following word is tagged w
– The preceding (following) word is W
Transformation-based Tagger: Learning
• Start with initial tagging
• Score the possible
transformations by
comparing their result to
the “truth”.
• Choose the
transformation that
maximizes the score
• Repeat last 2 steps