Part-Of-Speech Tagging

Download Report

Transcript Part-Of-Speech Tagging

Three Basic Problems

1.

2.

3.

• Compute the probability of a text (observation)

language modeling – evaluate alternative texts and models

• P

m

(W 1,

N

) Compute maximum probability tag (state) sequence

Tagging/classification

• arg max T 1,

N

P

m

(T 1,

N

| W 1,

N

) Compute maximum likelihood model

training / parameter estimation

arg max

m

P

m

(W 1,

N

)

Compute Text Probability

• Recall: P(W,T) =  i P(t i-1  t i ) P(w i | t i ) • Text probability: need to sum P(W,T) over all possible sequences – an exponential number • Dynamic programming approach – similar to the Viterbi algorithm • Will be used also for estimating model parameters from an untagged corpus

Define:

Forward Algorithm

A i (k) = P(w 1,k , t k = t i );

N

t – total num. of tags For

i =

1 To

N

t : A

i

(1) =

m

(t 0  t i )

m

(w 1 | t i ) 1.

2.

i.

For k = 2 To

N

; For j = 1 To

N

t : A j (k) = [  i A i (k-1)

m

(t i  t j ) ]

m

(w k | t j ) Then: P

m

(W 1,

N

) =  i A

i

(

N

) Complexity = O(

N

t 2

N

) (like Viterbi,  instead of

max)

Forward Algorithm

w 1 t 1 t 2 A 1 (1)

m

(t 1  t 1 )

m

(t 2  t 1 ) A 2 (1)

m

(t 3  t 1 )

m

(t 0  t i ) t 3 A 3 (1)

m

(t 4  t 1 ) w 2 t 1 A 1 (2)

m

(t 1  t 1 )

m

(t 2  t 1 ) t 2 A 2 (2)

m

(t 3  t 1 ) t 3 A 3 (2)

m

(t 4  t 1 ) t 4 t 5 A 4 (1)

m

(t 5  t 1 ) A 5 (1) t 4 t 5 A 4 (2)

m

(t 5  t 1 ) A 5 (2) P

m

(W 1,3 ) w 3 t 1 A 1 (3) t 2 A 2 (3) t 3 A 3 (3) t 4 A 4 (3) t 5 A 5 (3)

Backward Algorithm

Define B i (k) = P(w k+1,

N

| t k =t i ) 1. For

i =

1 To

N

t : B

i

(

N

) = 1 2. For k =

N

-1 To 1; For j = 1 To

N

t : i. B j (k) = [  i

m

(t j  t i )

m

(w k+1 | t i )B i (k+1) ] 3. Then: P

m

(W 1,

N

) =  i

m

(t 0  t i )

m

(w 1 | t i ) B i (1) Complexity = O(

N

t 2

N

)

P

m

(W 1,3 ) w 1 t 1

m

(t 0  t i ) t 2 B 1 (1) B 2 (1)

Backward Algorithm

m

(t 1  t 1 ) w 2 t 1 B 1 (2)

m

(t 2  t 1 )

m

(t 3  t 1 ) t 2 B 2 (2)

m

(t 1  t 1 ) w 3 t 1

m

(t 2  t 1 )

m

(t 3  t 1 ) t 2 B 1 (3) B 2 (3) t 3 B 3 (1) t 4 B 4 (1) t 5 B 5 (1)

m

(t 4  t 1 ) t 3 B 3 (2)

m

(t 5  t 1 ) t 4 t 5 B 4 (2) B 5 (2)

m

(t 4  t 1 ) t 3 B 3 (3)

m

(t 5  t 1 ) t 4 t 5 B 4 (3) B 5 (3)

Estimation from Untagged Corpus: EM – Expectation-Maximization

1.

2.

3.

4.

Start with some initial model Compute the probability of (virtually) each state sequence given the current model Use this probabilistic tagging to produce probabilistic counts for all parameters, and use these probabilistic counts to estimate a revised model, which increases the likelihood of the observed output W in each iteration Repeat until convergence

Note: No labeled training required. Initialize by lexicon constraints regarding possible POS for each word (cf. “noisy counting” for PP’s)

Notation

• a ij • b jk = Estimate of P(t i  t j ) = Estimate of P(w k | t j ) • A i (k) = P(w 1,k , t k =t i ) (from Forward algorithm) • B i (k) = P(w k+1,

N

| t k =t i ) (from Backwards algorithm)

Estimating transition probabilities

Define p k (i,j) as prob. of traversing arc t i  t j time k given the observations: at p k (i,j) = P(t k = t i , t k+1 = t j | W) = P(t k = = t i , t k+1

A i

(

k

)

a ij b

r N t

 1

A r jk B j

(

k

(

k

)

B r

= t j ,W) / P(W)  1 ) (

k

) =  1

A i

(

k

)

a ij b jk B j

(

k

 1 )

N t s

 1

A r

(

k

)

a rs b jk B s

(

k

 1 )

Expected transitions

• Define g i (k) = P(t k = t i | W), then: g i (k) = 

N j

 1

t p k

(

i

,

j

) • Now note that: – Expected number of transitions from tag i = 

N k

 1

g i

(

k

) – Expected transitions from tag i to tag j = 

N k

 1

p k

(

i

,

j

)

Re-estimation of Maximum Likelihood Parameters

• a’ ij = = expected # of transitio ns from tag

i

to

j

expected # of transitio ns from tag

i

 

N k

 1

k N

 1

p g k i

(

i

, (

k

)

j

) • b’ ik = expected # of observatio ns of

k

for tag

i

expected number of transitio ns from tag

i

= 

r

:

w r

w k

 

k N

 1

N t g i j

 1 (

k p r

) (

i

,

j

)

EM Algorithm

1. Choose initial model = <

a,b,g

(1)> 2. Repeat until results don’t improve (much): 1. Compute p k based on current model, using Forward & Backwards algorithms to compute A and B (Expectation for counts) 2. Compute new model <

a’

,

b’

,

g

’(1)> (Maximization of parameters) Note: Output likelihood is guaranteed to increase in each iteration, but might converge to a

local

maximum!

Initialize Model by Dictionary Constraints

• Training should be directed to correspond to the linguistic perception of POS (recall local max) • Achieved by a dictionary with possible POS for each word • Word-based initialization: –

P(w|t)

= 1 / #of listed POS for w, for the listed POS; and 0 for unlisted POS • Class-based initialization (Kupiec, 1992): – Group all words with the same possible POS into a ‘metaword’ – Estimate parameters and perform tagging for metawords – Frequent words are handled individually

Some extensions for HMM POS tagging

• Higher-order models: trigrams, possibly interpolated with bigrams • Incorporating text features: – Output prob = P(w i ,

f

j | t k ) where

f

is a vector of features (capitalized, ends in

–d

, etc.) – Features useful to handle unknown words • Combining labeled and unlabeled training (initialize with labeled then do EM)

Transformational Based Learning

• •

(TBL) for Tagging

• Introduced by Brill (1995) • Can exploit a wider range of lexical and syntactic regularities via transformation rules – triggering environment and rewrite rule

Tagger:

– Construct initial tag sequence for input – most frequent tag for each word – Iteratively refine tag sequence by applying “transformation rules” in rank order

Learner:

– Construct initial tag sequence for the training corpus – Loop until done: • Try all possible rules and compare to known tags, apply the best rule r* to the sequence and add it to the rule ranking

Some examples

1. Change NN to VB if previous is TO – to/TO conflict/

NN

with 

VB

2. Change VBP to VB if MD in previous three – might/MD vanish/

VBP

VB

3. Change NN to VB if MD in previous two – might/MD reply/

NN

VB

4. Change VB to NN if DT in previous two – the/DT reply/

VB

NN

Transformation Templates

• Specify which transformations are possible For example: change tag A to tag B when: 1. The preceding (following) tag is Z 2. The tag two before (after) is Z 3. One of the two previous (following) tags is Z 4. One of the three previous (following) tags is Z 5. The preceding tag is Z and the following is W 6. The preceding (following) tag is Z and the tag two before (after) is W

Lexicalization

New templates to include dependency on surrounding words (not just tags): Change tag A to tag B when: 1.

The preceding (following) word is w 2.

3.

4.

5.

6.

7.

The word two before (after) is w One of the two preceding (following) words is w The current word is w The current word is w and the preceding (following) word is v The current word is w and the preceding (following) tag is X (Notice:

word-tag

combination) etc…

Initializing Unseen Words

How to choose most likely tag for unseen words?

Transformation based approach: – Start with NP for capitalized words, NN for others – Learn “morphological” transformations from: Change tag from X to Y if: 1. Deleting prefix (suffix)

x

results in a known word 2. The first (last) characters of the word are

x

3. Adding

x

as a prefix (suffix) results in a known word 4. Word W ever appears immediately before (after) the word 5. Character Z appears in the word

TBL Learning Scheme

Unannotated Input Text Setting Initial State Annotated Text Learning Algorithm Ground Truth for Input Text Rules

Greedy Learning Algorithm

• Initial tagging of training corpus – most frequent tag per word • At each iteration: – Compute “error reduction” for each transformation rule: • #errors fixed - #errors introduced – Find best rule; If error reduction greater than a threshold

(to avoid overfitting)

: • Apply best rule to training corpus • Append best rule to ordered list of transformations

Morphological Richness

• Parts of speech really include

features

: – NN2  Noun(type=common,num=plural) This is more visible in other languages with richer morphology: – Hebrew nouns: number, gender, possession – German nouns: number, gender, case, … – And so on…