Basic statistics and n-grams

Download Report

Transcript Basic statistics and n-grams

PART-OF-SPEECH TAGGING
Universita’ di Venezia
1 Ottobre 2003
1
September 2003
This lecture





2
Tagsets
Rule-based tagging
Brill tagger
Tagging with Markov models
The Viterbi algorithm
September 2003
POS tagging: the problem




3
Secretariat/NNP is/VBZ expected/VBN to/TO
race/VB tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN for/IN
outer/JJ space/NN
Problem: assign a tag to race
Requires: tagged corpus
September 2003
Ambiguity in POS tagging
The
man
still
saw
her
4
AT
NN
NN
NN
PPO
VB
VB
VBD
PP$
RB
September 2003
How hard is POS tagging?
In the Brown corpus,
- 11.5% of word types ambiguous
- 40% of word TOKENS
5
Number of tags 1
2
Number of
words types
3760 264 61
35340
3
4
5
6
7
12
2
1
September 2003
Why is POS tagging useful?




6
Makes search of patterns of interest to linguists
in a corpus much easier (original motivation!)
Useful as a basis for parsing
For applications such as IR, provides some
degree of meaning distinction
In ASR, helps selection of next word
September 2003
Choosing a tagset


The choice of tagset greatly affects the
difficulty of the problem
Need to strike a balance between
–
–
7
Getting better information about context (best:
introduce more distinctions)
Make it possible for classifiers to do their job (need
to minimize distinctions)
September 2003
Some of the best-known Tagsets




8
Brown corpus: 87 tags
Penn Treebank: 45 tags
Lancaster UCREL C5 (used to tag the BNC):
61 tags
Lancaster C7: 145 tags
September 2003
Important Penn Treebank tags
9
September 2003
Verb inflection tags
10
September 2003
The entire Penn Treebank tagset
11
September 2003
UCREL C5
12
September 2003
Tagsets per l’italiano
PAROLE
Si-TAL (Pisa, Venezia, IRST, ....)
???
13
September 2003
Il tagset di SI-TAL
14
September 2003
POS tags in the Brown corpus
Television/NN has/HVZ yet/RB to/TO work/VB
out/RP a/AT living/RBG arrangement/NN with/IN
jazz/NN ,/, which/VDT comes/VBZ to/IN the/AT
medium/NN more/QL as/CS an/AT uneasy/JJ
guest/NN than/CS as/CS a/AT relaxed/VBN
member/NN of/IN the/AT family/NN ./.
15
September 2003
Esercizi
Abbonati al minimo ma la squadra piace
Si sta bene in B …
17
September 2003
Tagging methods



18
Hand-coded
Brill tagger
Statistical (Markov) taggers
September 2003
Hand-coded POS tagging:
the two-stage architecture


19
Early POS taggers all hand-coded
Most of these (Harris, 1962; Greene and
Rubin, 1971) and the best of the recent ones,
ENGTWOL (Voutilainen, 1995) based on a
two-stage architecture
September 2003
Hand-coded rules (ENGTWOL)
STEP 1: assign to each word a list of
potential parts of speech
- in ENGTWOL, this done by a two-lever
morphological analyzer (a finite state
transducer)
STEP 2: use about 1000 hand-coded
CONSTRAINTS (if-then rules) to choose
a tag using contextual information
- the constraints act as FILTERS
20
September 2003
Example
Pavlov had shown that salivation ….
Pavlov
PAVLOV N NOM SG PROPER
had
HAVE V PAST VFIN SVO
HAVE PCP2 SVOO
shown
SHOW PCP2 SVOO SVO SG
that
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
21
salivation
N NOM SG
September 2003
A constraint
ADVERBIAL-THAT RULE
Given input: “that”
if
(+1 A/ADV/QUANT); /* next word adj,adv, quant */
(+2 SENT-LIM);
/* and following that there is a
sentence boundary */
(NOT –1 SVOC/A); /* and previous word is not verb
`consider’ */
then eliminate non-ADV tags
else eliminate ADV tag.
22
September 2003
Tagging with lexical frequencies





23
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN
for/IN the/DT race/NN for/IN outer/JJ space/NN
Problem: assign a tag to race given its lexical frequency
Solution: we choose the tag that has the greater
– P(race|VB)
– P(race|NN)
Actual estimate from the Switchboard corpus:
– P(race|NN) = .00041
– P(race|VB) = .00003
September 2003
Factors that play a role in POS
tagging

Both the Brill tagger and HMM-based taggers
achieve good results by combining
–
FREQUENCY


–
information about CONTEXT


24
I poured FLOUR/NN into the bowl.
Peter should FLOUR/VB the baking tray
I saw the new/JJ PLAY/NN in the theater.
The boy will/MD PLAY/VBP in the garden.
September 2003
The Brill tagger




25
An example of TRANSFORMATION-BASED
LEARNING
Very popular (freely available, works fairly well)
A SUPERVISED method: requires a tagged
corpus
Basic idea: do a quick job first (using
frequency), then revise it using contextual rules
September 2003
An example

Examples:
–
–

It is expected to race tomorrow.
The race for outer space.
Tagging algorithm:
1.
Tag all uses of “race” as NN (most likely tag in the Brown
corpus)
•
•
2.
Use a transformation rule to replace the tag NN with VB for
all uses of “race” preceded by the tag TO:
•
•
26
It is expected to race/NN tomorrow
the race/NN for outer space
It is expected to race/VB tomorrow
the race/NN for outer space
September 2003
Transformation-based learning in
the Brill tagger
1.
2.
3.
4.
5.
Tag the corpus with the most likely tag for each word
Choose a TRANSFORMATION that deterministically
replaces an existing tag with a new one such that the
resulting tagged corpus has the lowest error rate
Apply that transformation to the training corpus
Repeat
Return a tagger that
a.
b.
27
first tags using unigrams
then applies the learned transformations in order
September 2003
The algorithm
28
September 2003
Examples of learned
transformations
29
September 2003
Templates
30
September 2003
An example
31
September 2003
Markov Model POS tagging

Again, the problem is to find an `explanation’
with the highest probability:
argmaxP(t1..tn | w1..wn )
t i T

As in yesterday’s case, this can be ‘turned
around’ using Bayes’ Rule:
P( w1..wn | t1..tn ) P(t1..tn )
argmax
P( w1..wn )
32
September 2003
Combining frequency and
contextual information

As in the case of spelling, this equation can be
simplified:
likelihood

 
prior


argmaxP( w1 ..wn | t1..t n ) P(t1..t n )

33
As we will see, once further simplifications are
applied, this equation will encode both
FREQUENCY and CONTEXT INFORMATION
September 2003
Three further assumptions

MARKOV assumption: a tag only depends on a
FIXED NUMBER of previous tags (here,
assume bigrams)
–


INDEPENDENCE assumption: words are
independent from each other.
A word’s identity only depends on its own tag
–
34
Simplify second factor
Simplify first factor
September 2003
The final equations
FREQUENCY
35
CONTEXT
September 2003
Estimating the probabilities
Can be done using Maximum Likelihood
Estimation as usual, for BOTH
probabilities:
36
September 2003
An example of tagging with Markov
Models :



Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/DT for/IN outer/JJ
space/NN
Problem: assign a tag to race given the subsequences
–
–

Solution: we choose the tag that has the greater of
these probabilities:
–
37
to/TO race/???
the/DT race/???
–
P(VB|TO) P(race|VB)
P(NN|TO)P(race|NN)
September 2003
Tagging with MMs (2)


Actual estimates from the Switchboard corpus:
LEXICAL FREQUENCIES:
–
–

CONTEXT:
–
–

P(NN|TO) = .021
P(VB|TO) = .34
The probabilities:
–
–
38
P(race|NN) = .00041
P(race|VB) = .00003
P(VB|TO) P(race|VB) = .00001
P(NN|TO)P(race|NN) = .000007
September 2003
A graphical interpretation of the
POS tagging equations
39
September 2003
Hidden Markov Models
40
September 2003
An example
41
September 2003
Computing the most likely
sequence of tags



42
In general, the problem of computing the most
likely sequence t1 .. tn could have exponential
complexity
It can however be solved in polynomial time
using an example of DYNAMIC
PROGRAMMING: the VITERBI ALGORITHM
(Viterbi, 1967)
(Also called TRELLIS ALGORITHMs)
September 2003
Trellis algorithms
43
September 2003
The Viterbi algorithm
44
September 2003
Viterbi (pseudo-code format)
45
September 2003
Viterbi: an example
46
September 2003
Markov chains and Hidden Markov
Models
Markov chain: only transition probabilities.
Each node associated with a single OUTPUT
Hidden Markov Models: nodes may have
more than one output; probability P(w|t) of
outputting word w from state t.
47
September 2003
Training HMMs
The reason why HMMS are so popular is
because they come with a LEARNING
ALGORITHM: the FORWARD-BACKWARD
algorithm (an instance of a class of
algorithms called EM algorithms)
Basic idea of the forward-backward
algorithm: start by assigning random
transition and emission probabilities, then
iterate
48
September 2003
Evaluation of POS taggers
Can reach up to 96.7% correct on Penn
Treebank (see Brants, 2000)
(But see next lecture)
49
September 2003
Additional issues
Most of the difference in performance
between POS algorithms depends on
their treatment of UNKNOWN WORDS
Multiple token words (‘Penn Treebank’)
Class-based N-grams
50
September 2003
Other techniques
There is a move away from HMMs for
this task and towards techniques that
make it easier to use multiple features
MAXIMUM ENTROPY taggers among
the highest performing at the moment
51
September 2003
Freely available POS taggers

Quite a few taggers are freely available
–
–
–
–
52
Brill (TBL)
QTAG (HMM; can be trained for other languages)
LT POS (part of the Edinburgh LTG suite of tools)
See Chris Manning’s Statistical NLP resources web
page (from the course web page)
September 2003
POS tagging per l’italiano
Xerox Grenoble
IMMORTALE (Universita’ di Venezia)
Pi-Tagger (Universita’ di Pisa)
53
September 2003
Other kinds of tagging




54
Sense tagging (SEMCOR, SENSEVAL)
Syntactic tagging (`supertagging’)
Dialogue act tagging
Semantic tagging (animacy, etc.)
September 2003
Readings

55
Jurafsky and Martin, chapter 8
September 2003