한국어에 기반한 인터넷 지식 정보의 지능적 통합 기술

Download Report

Transcript 한국어에 기반한 인터넷 지식 정보의 지능적 통합 기술

Part-of-Speech Tagging
인공지능 연구실
정성원
The beginning
• The task of labeling (or tagging) each word in a sentence
with its appropriate part of speech.
– The representative put chairs on the table
AT
NN
VBD NNS IN AT NN
• Using Brown/Penn tag sets
• A problem of limited scope
– Instead of constructing a complete parse
– Fix the syntactic categories of the word in a sentence
• Tagging is a limited but useful application.
– Information extraction
– Question and answering
– Shallow parsing
2
The Information Sources in Tagging
• Syntagmatic: look at the tags assigned to nearby
words; some combinations are highly likely while
others are highly unlikely or impossible
– ex) a new play
– AT JJ NN
– AT JJ VBP
• Lexical : look at the word itself. (90% accuracy just
by picking the most likely tag for each word)
– Verb is more likely to be a noun than a verb
3
Notation
•
•
•
•
•
•
•
•
•
•
•
•
•
wi
ti
wi,i+m
ti,i+m
wl
tj
C(wl)
C(tj)
C(tj,tk)
C(wl,tj)
T
W
n
the word at position i in the corpus
the tag of wi
the words occurring at positions i through i+m
the tags ti … ti+m for wi … wi+m
the lth word in the lexicon
the jth tag in the tag set
the number of occurrences of wl in the training set
the number of occurrences of tj in the training set
the number of occurrences of tj followed by tk
the number of occurrences of wl that are tagged as tj
number of tags in tag set
number of words in the lexicon
sentence length
4
The Probabilistic Model (I)
• The sequence of tags in a text as Markov chain.
– A word’s tag only depends on the previous tag (Limited
horizon)
P( X i1  t j | X1,..., X i )  P( X i 1  t j | X i )
– Dependency does not change over time (Time invariance)
j
j
P
(
X

t
|
X
)

P
(
X

t
| XProperty
• compact notation
: Limited
Horizon
i 1
i
2
1)
P(ti 1 | t1,i )  P(ti1 | ti )
5
The Probabilistic Model (II)
• Maximum likelihood estimate tag following
j
k
C (t , t )
P(t | t ) 
j
C (t )
k
j
l
j
C
(
w
,
t
)
l
j
P( w | t ) 
C (t j )
arg max P(t1,n | w1,n )  arg max
t1,n
P(w1,n | t1,n ) P(t1,n )
P(w1,n )
 arg max P(w1,n | t1,n ) P(t1,n )
t1,n
t1,n
6
The Probabilistic Model (III)
n
P(wi ,n | ti ,n ) P(ti ,n )   P( wi | ti ,n )  P(t n | t1,n 1 )  P(tn 1 | t1,n 2 )  ... P(t2 | t1 )
i 1
n
  P( wi | ti )  P(t n | t n 1 )  P(t n 1 | t n  2 )  ... P(t 2 | t1 )
i 1
n
  [ P( wi | ti )  P(ti | ti 1 )]
i 1
(We define P(t1|t0)=1.0 to simplify our notation)
• The final equation
^
n
t1,n  arg max P(ti ,n | wi ,n )   P(wi | ti ) P(ti | ti 1 )
t1,n
i 1
7
The Probabilistic Model (III)
• Algorithm for training a Visible Markov Model Tagger
Syntagmatic Probabilities:
for all tags tj do
for all tags tk do
P(tk | tj)=C(tj, tk)/C(tj)
end
end
Lexical Probabilities:
for all tags tj do
for all words wl do
P(wl | tj)=C(wl, tj)/C(tj)
end
end
8
The Probabilistic Model (IV)
Second tag
First tag
AT
AT
BEZ
IN
NN
VB
PERIOD
0
0
0
48636
0
19
BEZ
1973
0
426
187
0
38
IN
43322
0
1325
17314
0
185
NN
1067
3720
42470
11773
614
21392
VB
6072
42
4758
1476
129
1522
PERIOD
8016
75
4656
1329
954
0
<Idealized counts of some tag transitions in the Brown Corpus>
9
The Probabilistic Model (V)
AT
BEZ
IN
NN
VB
PERIOD
bear
0
0
0
10
43
0
is
0
10065
0
0
0
0
move
0
0
0
36
133
0
on
0
0
5484
0
0
0
president
0
0
0
382
0
0
progress
0
0
0
108
4
0
69016
0
0
0
0
0
0
0
0
0
0
48809
the
.
<Idealized counts for the tags that some words occur with in the Brown Corpus>
10
The Viterbi algorithm
comment : Given: a sentence of length n
comment : Initialization
δ1(PERIOD) = 1.0
δ1(t) = 0.0 for t ≠ PERIOD
comment : Induction
for i := 1 to n step 1 do
for all tags tj do
δi+1(tj) := max1<=k<=T[δi(tk)*P(wi+1|tj)*P(tj|tk)]
ψi+1(tj) := argmax1<=k<=T[δi(tk)*P(wi+1|tj)*P(tj|tk)]
end
end
comment : Termination and path-readout
Xn+1 = argmax1<=j<=T δn+1(j)
for j := n to 1 step – 1 do
Xj = ψj+1(Xj+1)
end
P(X1 , … , Xn) = max1<=j<=T δn+1(tj)
11
Variations (I)
• Unknown words
– Unknown words are a major problem for taggers
– The simplest model for unknown words
• Assume that they can be of any part of speech
– Use morphological information
• Past tense form : words ending in –ed
– Capitalized
1
P( w | t )  P(unknown wo rd | t j ) P(captalized | t j ) P(endings / hyph | t j )
Z
l
j
12
Variations (II)
• Trigram taggers
– The basic Markov Model tagger = bigram tagger
– two tag memory
– disambiguate more cases
• Interpolation and variable memory
– trigram tagger may make worse pridictions than a
bigram tagger
– linear interpolation
P(ti | t1,i1 )  1P1 (ti )  2 P2 (ti | ti 1 )  3 P3 (ti | ti 1,i2 )
• Variable Memory Markov Model
13
Variations (III)
• Smoothing
P(t j | t j 1 )  (1   )
j 1
j
C (t , t )

j 1
C (t )
j
l
C
(
t
,
w
) 1
P(t j | wl ) 
C (wl )  Kl
Kl is the number of possible parts of speech of wl
• Reversibility
– Markov model decodes from left to right
= decodes from right to left
P(t1,n )  P(t1 ) P(t1, 2 | t1 ) P(t2,3 | t2 )...P(tn 1,n | tn 1 )

P(t1 ) P(t1, 2 ) P(t 2,3 )...P(t n 1,n )
P(t1 ) P(t 2 )...P(t n 1 )
 P(t n ) P(t1, 2 | t 2 ) P(t 2,3 | t3 )...P(t n 1,n | t n )
14
Variations (IV)
• Maximum Likelihood: Sequence vs. tag by tag
– Viterbi Alogorithm : maximize P(t1,n|w1,n)
– Consider : maximize P(ti|w1,n)
• for all i which amounts to summing over different tag sequance
– ex) Time flies like a arrow.
•
•
•
•
a. NN
b. NN
c. NN
d. NN
VBZ
NNS
NNS
VBZ
RB AT NN.
VB AT NN.
RB AT NN.
VB AT NN.
P(.) = 0.01
P(.) = 0.01
P(.) = 0.001
P(.) = 0
– one error does not affect the tagging of other words
15
Applying HMMs to POS tagging(I)
• If we have no training data, we can use a HMM to
learn the regularities of tag sequences.
• HMM consists of the following elements
–
–
–
–
–
a set of states ( = tags )
an output alphabet ( words or classes of words )
initial state probabilities
state transition probabilities
symbol emission probabilities
16
Applying HMMs to POS tagging(II)
• Jelinek’s method
– bj.l : probability that word (or word class) l is emitted by
tag j
b j .l 
b*j .l C ( wl )
*
m
b
C
(
w
)
 w m j .m
 0
if t j is not a part of speech allowed for wl

*
b j .l   1
otherwise
l

T ( w )
T(w j ) is thenumber of tags allowed for w j
17
Applying HMMs to POS tagging(III)
• Kupiec’s method
uL  {wl | j  L  t j is allowed for wl }
b j.L 

b*j . L C (u L )
uL '
b*j . L
L  {1,...,T}
b*j .L 'C (u L ' )
if J  L
 0
 1

ot herwise
 | L |
|L| is the number of indices in L
18
Transformation-Based Learning of
Tags
• Markov assumption are too crude
→ transformation-based tagging
• Exploit a wider range
• An order of magnitude fewer decisions
• Two key components
– a specification of which ‘error-correcting’
transformations are admissible
– The learning algorithm
19
Transformation(I)
• A triggering environment
Schema
1
2
3
4
5
6
7
8
9
t i-3
t i-2
t i-1
t
*
*
*
*
*
*
*
*
*
t i+1
t i+2
t i+3
• A rewrite rule
– Form t1→t2 : replace t1 by t2
20
Transformation(II)
Source tag
Target Tag
Trigging environment
NN
VB
previous tag is TO
VBP
VB
one of the previous three tags is MD
JJR
RBR
next tag is JJ
VBP
VB
one of the previous two words is n’t
• environments can be conditioned
– combination of words and tags
• Morphology-triggered transformation
– ex) Replace NN by NNS if the unknown word’s suffix is -s
21
The learning algorithm
C0 := corpus with each word tagged with its most frequent tag
for k:=0 step 1 do
ν:=the transformation ui that minimizes E(ui(Ck))
if (E(Ck)-E(ν(Ck))) < Є then break fi
Ck+1:= ν(Ck)
τk+1:= ν
end
Output sequence: τ1, …, τk
22
Relation to other models
• Decision trees
– similarity with Transformation-based learning
• a series of relableing
– difference with Transformation-based learning
• split at each node in a decision tree
• different sequence of transformation for each node
• Probabilistic models in general
23
Automata
• Transformation-based tagging has a rule
component, it also has a quantitative component.
• Once learning is complete, transformation-based
tagging is purely symbolic
• Transformation-based tagger can be converted into
another symbolic object
• Roche and Schobes(1995) : finite state transducer
• Advantage : speed
24
Other Method, Other Languages
• Other approaches to tagging
– In chapter 16
• Languages other than English
– In many other languages, word order is much
freer
– The rich inflections of a word contribute more
information about part of speech
25
Tagging accuracy
• 95%~97% when calculated over all words
• Considerable factors
–
–
–
–
The amount of training data available
The tag set
The difference between training set and test set
Unknown words
• a ‘dump’ tagger
– Always chooses a word’s most frequent tag
– Accuracy of about 90%
• EngCG
26
Applications of tagging
• Benefit from syntactically disambiguated text
• Partial Parsing
– Finding none phrases of sentence
• Information Extraction
– Finding value for the predefined slots of a template
– Finding good indexing term in information retrieval
• Question Answering
– Returning an appropriate noun such as a location, a person,
or a date
27