Basic statistics and n-grams

Download Report

Transcript Basic statistics and n-grams

PART-OF-SPEECH TAGGING
1
Topics of the next three lectures





2
Tagsets
Rule-based tagging
Brill tagger
Tagging with Markov models
The Viterbi algorithm
POS tagging: the problem



3
People/NNS continue/VBP to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN for/IN
outer/JJ space/NN
Problem: assign a tag to race
Requires: tagged corpus
Why is POS tagging useful?




4
Makes search of patterns of interest to linguists
in a corpus much easier (original motivation!)
Useful as a basis for parsing
For applications such as IR, provides some
degree of meaning distinction
In ASR, helps selection of next word
Ambiguity in POS tagging
The
man
still
saw
her
5
AT
NN
NN
NN
PPO
VB
VB
VBD
PP$
RB
How hard is POS tagging?
In the Brown corpus,
- 11.5% of word types ambiguous
- 40% of word TOKENS
6
Number of tags 1
2
Number of
words types
3760 264 61
35340
3
4
5
6
7
12
2
1
Frequency + Context

Both the Brill tagger and HMM-based taggers
achieve good results by combining
–
FREQUENCY


–
Information about CONTEXT


7
I poured FLOUR/NN into the bowl.
Peter should FLOUR/VB the baking tray
I saw the new/JJ PLAY/NN in the theater.
The boy will/MD PLAY/VBP in the garden.
The importance of context


8
Secretariat/NNP is/VBZ expected/VBN to/TO
race/VB tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN for/IN
outer/JJ space/NN
Choosing a tagset


The choice of tagset greatly affects the
difficulty of the problem
Need to strike a balance between
–
–
9
Getting better information about context (best:
introduce more distinctions)
Make it possible for classifiers to do their job (need
to minimize distinctions)
Some of the best-known Tagsets




10
Brown corpus: 87 tags
Penn Treebank: 45 tags
Lancaster UCREL C5 (used to tag the BNC):
61 tags
Lancaster C7: 145 tags
Important Penn Treebank tags
11
Verb inflection tags
12
The entire Penn Treebank tagset
13
UCREL C5
14
Il tagset di SI-TAL
16
POS tags in the Brown corpus
Television/NN has/HVZ yet/RB to/TO work/VB
out/RP a/AT living/RBG arrangement/NN with/IN
jazz/NN ,/, which/VDT comes/VBZ to/IN the/AT
medium/NN more/QL as/CS an/AT uneasy/JJ
guest/NN than/CS as/CS a/AT relaxed/VBN
member/NN of/IN the/AT family/NN ./.
17
SGML-based POS in the BNC
<div1 complete=y org=seq>
18
<head>
<s n=00040> <w NN2>TROUSERS <w VVB>SUIT
</head>
<caption>
<s n=00041> <w EX0>There <w VBZ>is <w PNI>nothing
<w AJ0>masculine <w PRP>about <w DT0>these <w
AJ0>new <w NN1>trouser <w NN2-VVZ>suits <w
PRP>in <w NN1>summer<w POS>'s <w AJ0>soft <w
NN2>pastels<c PUN>.
<s n=00042> <w NP0>Smart <w CJC>and <w
AJ0>acceptable <w PRP>for <w NN1>city <w NN1VVB>wear <w CJC>but <w AJ0>soft <w AV0>enough <w
PRP>for <w AJ0>relaxed <w NN2>days
</caption>
Quick test
DoCoMo and Sony are to develop a chip that
would let people pay for goods through their
mobiles.
20
Tagging methods



21
Hand-coded
Brill tagger
Statistical (Markov) taggers
Hand-coded POS tagging:
the two-stage architecture


22
Early POS taggers all hand-coded
Most of these (Harris, 1962; Greene and
Rubin, 1971) and the best of the recent ones,
ENGTWOL (Voutilainen, 1995) based on a
two-stage architecture
Hand-coded rules (ENGTWOL)
STEP 1: assign to each word a list of
potential parts of speech
- in ENGTWOL, this done by a two-lever
morphological analyzer (a finite state
transducer)
STEP 2: use about 1000 hand-coded
CONSTRAINTS (if-then rules) to choose
a tag using contextual information
- the constraints act as FILTERS
23
Example
Pavlov had shown that salivation ….
Pavlov
PAVLOV N NOM SG PROPER
had
HAVE V PAST VFIN SVO
HAVE PCP2 SVOO
shown
SHOW PCP2 SVOO SVO SG
that
ADV
PRON DEM SG
DET CENTRAL DEM SG
CS
24
salivation
N NOM SG
A constraint
ADVERBIAL-THAT RULE
Given input: “that”
if
(+1 A/ADV/QUANT); /* next word adj,adv, quant */
(+2 SENT-LIM);
/* and following that there is a
sentence boundary */
(NOT –1 SVOC/A); /* and previous word is not verb
`consider’ */
then eliminate non-ADV tags
else eliminate ADV tag.
25
Tagging with lexical frequencies





26
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN
for/IN the/DT race/NN for/IN outer/JJ space/NN
Problem: assign a tag to race given its lexical frequency
Solution: we choose the tag that has the greater
– P(race|VB)
– P(race|NN)
Actual estimate from the Switchboard corpus:
– P(race|NN) = .00041
– P(race|VB) = .00003
The Brill tagger




27
An example of TRANSFORMATION-BASED
LEARNING
Very popular (freely available, works fairly well)
A SUPERVISED method: requires a tagged
corpus
Basic idea: do a quick job first (using
frequency), then revise it using contextual rules
An example

Examples:
–
–

It is expected to race tomorrow.
The race for outer space.
Tagging algorithm:
1.
Tag all uses of “race” as NN (most likely tag in the Brown
corpus)
•
•
2.
Use a transformation rule to replace the tag NN with VB for
all uses of “race” preceded by the tag TO:
•
•
28
It is expected to race/NN tomorrow
the race/NN for outer space
It is expected to race/VB tomorrow
the race/NN for outer space
Transformation-based learning in
the Brill tagger
1.
2.
3.
4.
5.
Tag the corpus with the most likely tag for each word
Choose a TRANSFORMATION that deterministically
replaces an existing tag with a new one such that the
resulting tagged corpus has the lowest error rate
Apply that transformation to the training corpus
Repeat
Return a tagger that
a.
b.
29
first tags using unigrams
then applies the learned transformations in order
The algorithm
30
Examples of learned
transformations
31
Templates
32
An example
33
Markov Model POS tagging

Again, the problem is to find an `explanation’
with the highest probability:
argmaxP(t1..tn | w1..wn )
t i T

As in yesterday’s case, this can be ‘turned
around’ using Bayes’ Rule:
P( w1..wn | t1..tn ) P(t1..tn )
argmax
P( w1..wn )
34
Combining frequency and
contextual information

As in the case of spelling, this equation can be
simplified:
likelihood

 
prior


argmaxP( w1 ..wn | t1..t n ) P(t1..t n )

35
As we will see, once further simplifications are
applied, this equation will encode both
FREQUENCY and CONTEXT INFORMATION
Three further assumptions

MARKOV assumption: a tag only depends on a
FIXED NUMBER of previous tags (here,
assume bigrams)
–


INDEPENDENCE assumption: words are
independent from each other.
A word’s identity only depends on its own tag
–
36
Simplify second factor
Simplify first factor
The final equations
FREQUENCY
37
CONTEXT
Estimating the probabilities
Can be done using Maximum Likelihood
Estimation as usual, for BOTH
probabilities:
38
An example of tagging with Markov
Models :



Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/DT for/IN outer/JJ
space/NN
Problem: assign a tag to race given the subsequences
–
–

Solution: we choose the tag that has the greater of
these probabilities:
–
39
to/TO race/???
the/DT race/???
–
P(VB|TO) P(race|VB)
P(NN|TO)P(race|NN)
Tagging with MMs (2)


Actual estimates from the Switchboard corpus:
LEXICAL FREQUENCIES:
–
–

CONTEXT:
–
–

P(NN|TO) = .021
P(VB|TO) = .34
The probabilities:
–
–
40
P(race|NN) = .00041
P(race|VB) = .00003
P(VB|TO) P(race|VB) = .00001
P(NN|TO)P(race|NN) = .000007
A graphical interpretation of the
POS tagging equations
41
Hidden Markov Models
42
An example
43
Computing the most likely
sequence of tags



44
In general, the problem of computing the most
likely sequence t1 .. tn could have exponential
complexity
It can however be solved in polynomial time
using an example of DYNAMIC
PROGRAMMING: the VITERBI ALGORITHM
(Viterbi, 1967)
(Also called TRELLIS ALGORITHMs)
Trellis algorithms
45
The Viterbi algorithm
46
Viterbi (pseudo-code format)
47
Viterbi: an example
48
Markov chains and Hidden Markov
Models
Markov chain: only transition probabilities.
Each node associated with a single OUTPUT
Hidden Markov Models: nodes may have
more than one output; probability P(w|t) of
outputting word w from state t.
49
Training HMMs
The reason why HMMS are so popular is
because they come with a LEARNING
ALGORITHM: the FORWARD-BACKWARD
algorithm (an instance of a class of
algorithms called EM algorithms)
Basic idea of the forward-backward
algorithm: start by assigning random
transition and emission probabilities, then
iterate
50
Evaluation of POS taggers
Can reach up to 96.7% correct on Penn
Treebank (see Brants, 2000)
(But see next lecture)
51
Additional issues
Most of the difference in performance
between POS algorithms depends on
their treatment of UNKNOWN WORDS
Multiple token words (‘Penn Treebank’)
Class-based N-grams
52
Other techniques
There is a move away from HMMs for
this task and towards techniques that
make it easier to use multiple features
MAXIMUM ENTROPY taggers among
the highest performing at the moment
53
Freely available POS taggers

Quite a few taggers are freely available
–
–
–
–
54
Brill (TBL)
QTAG (HMM; can be trained for other languages)
LT POS (part of the Edinburgh LTG suite of tools)
See Chris Manning’s Statistical NLP resources web
page (from the course web page)
Other kinds of tagging




56
Sense tagging (SEMCOR, SENSEVAL)
Syntactic tagging (`supertagging’)
Dialogue act tagging
Semantic tagging (animacy, etc.)
Readings

57
Jurafsky and Martin, chapter 8