LINGUISTICA GENERALE E COMPUTAZIONALE - clic

Download Report

Transcript LINGUISTICA GENERALE E COMPUTAZIONALE - clic

LINGUISTICA GENERALE E
COMPUTAZIONALE
DISAMBIGUAZIONE DELLE PARTI DEL
DISCORSO
POS tagging: the problem
• People/NNS continue/VBP to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
• Problem: assign a tag to race
• Requires: tagged corpus
2
Ambiguity in POS tagging
The
man
still
saw
her
AT
NN
NN
NN
PPO
VB
VB
VBD
PP$
RB
3
How hard is POS tagging?
In the Brown corpus,
- 11.5% of word types ambiguous
- 40% of word TOKENS
Number of tags 1
2
Number of
words types
3760 264 61
35340
3
4
5
6
7
12
2
1
4
Frequency + Context
• Both the Brill tagger and HMM-based taggers
achieve good results by combining
– FREQUENCY
• I poured FLOUR/NN into the bowl.
• Peter should FLOUR/VB the baking tray
– Information about CONTEXT
• I saw the new/JJ PLAY/NN in the theater.
• The boy will/MD PLAY/VBP in the garden.
5
The importance of context
• Secretariat/NNP is/VBZ expected/VBN to/TO
race/VB tomorrow/NN
• People/NNS continue/VBP to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
6
TAGGED CORPORA
Choosing a tagset
• The choice of tagset greatly affects the
difficulty of the problem
• Need to strike a balance between
– Getting better information about context (best:
introduce more distinctions)
– Make it possible for classifiers to do their job
(need to minimize distinctions)
8
Some of the best-known Tagsets
• Brown corpus: 87 tags
• Penn Treebank: 45 tags
• Lancaster UCREL C5 (used to tag the BNC): 61
tags
• Lancaster C7: 145 tags
9
Important Penn Treebank tags
10
Verb inflection tags
11
The entire Penn Treebank tagset
12
UCREL C5
13
Tagsets per l’italiano
PAROLE
Si-TAL (Pisa, Venezia, IRST, ....)
TEXTPRO (dopo)
14
Il tagset di SI-TAL
15
POS tags in the Brown corpus
Television/NN has/HVZ yet/RB to/TO work/VB out/RP a/AT
living/RBG arrangement/NN with/IN jazz/NN ,/, which/VDT
comes/VBZ to/IN the/AT medium/NN more/QL as/CS an/AT
uneasy/JJ guest/NN than/CS as/CS a/AT relaxed/VBN
member/NN of/IN the/AT family/NN ./.
16
SGML-based POS in the BNC
<div1 complete=y org=seq>
<head>
<s n=00040> <w NN2>TROUSERS <w VVB>SUIT
</head>
<caption>
<s n=00041> <w EX0>There <w VBZ>is <w PNI>nothing
<w AJ0>masculine <w PRP>about <w DT0>these <w
AJ0>new <w NN1>trouser <w NN2-VVZ>suits <w
PRP>in <w NN1>summer<w POS>'s <w AJ0>soft <w
NN2>pastels<c PUN>.
<s n=00042> <w NP0>Smart <w CJC>and <w
AJ0>acceptable <w PRP>for <w NN1>city <w NN1VVB>wear <w CJC>but <w AJ0>soft <w AV0>enough <w
PRP>for <w AJ0>relaxed <w NN2>days
</caption>
17
Quick test
DoCoMo and Sony are to develop a chip that would let people
pay for goods through their mobiles.
18
POS TAGGED CORPORA IN NLTK
>>> tagged_token =
nltk.tag.str2tuple('fly/NN')
>>> tagged_token
('fly', 'NN')
>>> tagged_token[0]
'fly'
>>> tagged_token[1]
'NN'
>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'),
('County', 'NN-TL'), ...]
Exploring tagged corpora
• Ch.5, p. 184-189
OTHER POS-TAGGED CORPORA
• NLTK:
• WAC Corpora:
– English: UKWAC
– Italian: ITWAC
POS TAGGING
Markov Model POS tagging
• Again, the problem is to find an `explanation’
with the highest probability:
argmax P ( t1 ..t n | w1 .. w n )
t i T
• As in the lecture on text classification, this can
be ‘turned around’ using Bayes’ Rule:
argmax
P ( w1 .. w n | t1 ..t n ) P ( t1 ..t n )
P ( w1 .. w n )
23
Combining frequency and contextual
information
• As in the case of spelling, this equation can be
simplified:
prior
likelihood
 
   


argmax P ( w 1 .. w n | t1 ..t n ) P ( t1 ..t n )
• As we will see, once further simplifications are
applied, this equation will encode both
FREQUENCY and CONTEXT INFORMATION
24
Three further assumptions
• MARKOV assumption: a tag only depends on a
FIXED NUMBER of previous tags (here, assume
bigrams)
– Simplify second factor
• INDEPENDENCE assumption: words are
independent from each other.
• A word’s identity only depends on its own tag
– Simplify first factor
25
The final equations
CONTEXT
FREQUENCY
26
Estimating the probabilities
Can be done using Maximum Likelihood Estimation as
usual, for BOTH probabilities:
27
An example of tagging with Markov
Models :
• Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
• People/NNS continue/VBP to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
• Problem: assign a tag to race given the subsequences
– to/TO race/???
– the/DT race/???
• Solution: we choose the tag that has the greater of these
probabilities:
– P(VB|TO) P(race|VB)
– P(NN|TO)P(race|NN)
28
Tagging with MMs (2)
• Actual estimates from the Switchboard corpus:
• LEXICAL FREQUENCIES:
– P(race|NN) = .00041
– P(race|VB) = .00003
• CONTEXT:
– P(NN|TO) = .021
– P(VB|TO) = .34
• The probabilities:
– P(VB|TO) P(race|VB) = .00001
– P(NN|TO)P(race|NN) = .000007
29
A graphical interpretation of the POS
tagging equations
30
Hidden Markov Models
31
An example
32
Computing the most likely sequence of
tags
• In general, the problem of computing the
most likely sequence t1 .. tn could have
exponential complexity
• It can however be solved in polynomial time
using an example of DYNAMIC
PROGRAMMING: the VITERBI ALGORITHM
(Viterbi, 1967)
• (Also called TRELLIS ALGORITHMs)
33
POS TAGGING IN NLTK
DEFAULT POS TAGGER: nltk.pos_tag
>>> text = nltk.word_tokenize("And now for something
completely different")
>>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'),
('for', 'IN'), ('something', 'NN'), ('completely',
'RB'), ('different', 'JJ')]
TEXTPRO
• The most widely used NLP tool for Italian
• http://textpro.fbk.eu/
• Demo
THE TEXTPRO TAGSET
READINGS
• Bird et al, chapter 5, chapter 6.1