More about tagging, assignment 2

Download Report

Transcript More about tagging, assignment 2

More about tagging,
assignment 2
DAC723 Language Technology
Leif Grönqvist
4. March, 2003
Part-of-speech tagging (reminder)
• We want to assign the right part-of-speech
to each word in a corpus
• The tagset is determined in advance
• The word types in the corpus have various
properties in the lexicon or training data
– Some are unambiguous
– Some are ambiguous (typically 2-7 POS)
– Some are unknown
Various approaches
• Rule based tagging
– Constraint based tagging (SweTwol, EngTwol
by Lingsoft)
– Transformation-based tagging (Eric Brill)
• Stochastic tagging (HMM)
– Using maximum likelihood estimation
– Or some bootstrap based training (i.e. BaumWelch)
An HMM tagger
The problem may be formulated as:
Which may be reformulated as:
But the denominator is constant and may be removed and
we get:
HMM tagger, cont.
The Markov assumption (for n=3) and the chain
rule gives us:
What we need now is:
The example: HMM
Word
you
Seq.1
pron
Seq.2
pron
Seq.3
pron
Seq.4
pron
have
to
verb
infmrk
verb
infmrk
verb
infmrk
verb
infmrk
book
a
noun
art
noun
art
verb
art
verb
art
chair
on
deck
noun
prep
noun
verb
prep
noun
noun
prep
noun
verb
prep
noun
Select the
sequence with
the highest
probability!
Overview of assignment 2
• Parameter estimation (training)
– MLE
– Smoothing
• Implementation
– Data structures for all the probabilities
– The Viterbi algorithm
• Testing
– Create a random sample from the test data
– Calculate accuracy rate
• Report
– Description and results
– Hand it in before the end of March
Parameter estimation: MLE
• We need
– Contextual probabilities: P(ti|ti-1)
– Lexical probabilities: P(w|t)
• We assume that test data will have the
same properties as training data, so:
– P(ti|ti-1) = f(ti-1,ti) / f(ti-1)
– P(w|t) = f(w, t) / f(t)
• The frequencies are taken from the
training data
The training data
("<Ändå>" (AB "ändå"))
("<bär>" (VB PRS AKT "bära"))
("<han>" (PN UTR SIN DEF SUB "han"))
("<på>" (PP "på"))
("<en>" (DT UTR SIN IND "en"))
("<misstanke>" (NN UTR SIN IND NOM "misstanke"))
("<att>" (SN "att"))
("<chanserna>" (NN UTR PLU DEF NOM "chans"))
("<till>" (PP "till"))
("<sommarjobb>" (NN NEU SIN IND NOM "sommarjobb"))
("<hade>" (VB PRT AKT "ha"))
("<sett>" (VB SUP AKT "se"))
("<annorlunda>" (AB "annorlunda"))
("<ut>" (AB "ut"))
("<om>" (SN "om"))
("<han>" (PN UTR SIN DEF SUB "han"))
("<varit>" (VB SUP AKT "vara"))
("<ljushyad>" (JJ POS UTR SIN IND NOM "ljushyad"))
("<.>" (DL MAD "."))
Smoothing
• We need smoothing to make sure there are
sequences with non-zero probability
– P(ti|ti-1)>0 for all tag pairs (ti-1,ti)
– For all words w there is at least one tag t so P(w|t)>0
• Laplace's Law (additive smoothing) is good
enough for this exercise
• The lexical probabilities need smoothing only for
unknown words, so accept that for example
P(misstanke|pron)=0
• To get a better result and a faster tagger, it is
also a good idea to smooth unknown word for
open classes only: AB, JJ, NN, PC, PM, RG, VB
Implementation: data structures
• A fast implementation will need fast lookup of probabilities:
– The contextual probabilities could be put in a
matrix but maybe this is not needed, I don’t
know.
– The lexical probabilities could be put in a trie if
a hash table is not fast enough
• All these choices are up to you!
• It may of course depend on what
programming language you are using
The Viterbi algorithm
• The secret is to keep just the probability and the
best trace (highest probability) to each class at
the current position
• We will loop through the text from position 1 to
the end, word by word
• For each class in the next pos we have to check
each class in current pos to be able to find the
best choice
– This will end up in a nested loop over the classes
inside the loop over positions
• Pseudo code on page 179, fig 5.19 in J&M
Viterbi, example
Infm
Verb
Noun
.001
.0001
.00002
to
book
a
chair
Infm
Verb
Noun
Infm
.0001
.01
.01
Verb
.5
.02
.02
book
.00001 .001
Noun
.0001
.02
.01
to
.7
Infm
Verb
Noun
.001
.00001 .00001
Test corpus sampling
• One way is: take random sentences from
test data until 5000 tokens are reached –
no duplicated sentences!
• Another: take 10 blocks of 500 tokens
each – don’t cut in the middle of
sentences!
• Make sure to save the test corpus to be
able to test on the same corpus several
times
Measuring accuracy
• Overall accuracy should be measured
• Useful is also the variance over say 10 blocks.
This gives a measure of the stability
• Calculate the accuracy for a baseline tagger
using no contextual information
• Report
– Description, results and analysis/discussion should be
included
– Hand it in (by email) before the end of March
– Tell me where to find, and how to run your tests!