Transcript Document

Part-of-speech tagging and chunking with log-linear models

University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka

Outline

• POS tagging and Chunking for English – Conditional Markov Models (CMMs) – Dependency Networks – Bidirectional CMMs • Maximum entropy learning • Conditional Random Fields (CRFs) • Domain adaptation of a tagger

Part-of-speech tagging

The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS • The tagger assigns a part-of-speech tag to each word in the sentence.

Algorithms for part-of-speech tagging

• Tagging speed and accuracy on WSJ Dependency Net (2003) SVM (2004) Perceptron (2002) Bidirectional CMM (2005) HMM (2000) CMM (1998) Tagging Speed Slow Fast ?

Fast Very fast Fast Accuracy 97.24

97.16

97.11

97.10

96.7* 96.6* * evaluated on different portion of WSJ

Chunking (shallow parsing)

He reckons the current account deficit will narrow to NP VP NP VP PP only # 1.8 billion in September . NP PP NP • A chunker (shallow parser) segments a sentence into non-recursive phrases

Chunking (shallow parsing)

He reckons the current account deficit will narrow to B NP B VP B NP I NP I NP only # 1.8 billion in September . I NP B VP I VP B PP B NP I NP I NP I NP B PP B NP • Chunking tasks can be converted into a standard tagging task • Different approaches: – Sliding window – Semi-Markov CRF – …

Algorithms for chunking

• Chunking speed and accuracy on Penn Treebank SVM + voting (2001) Perceptron (2003) Bidirectional CMM (2005) SVM (2000) Tagging Speed Slow?

?

Fast Fast Accuracy 93.91

93.74

93.70

93.48

Conditional Markov Models (CMMs)

t 1 t 2 t 3 o • Left to right decomposition (with the first-order Markov assumption)

P

t

1 ...

t n

|

o P t i

|

t

1 ...

t i

 1

o

 

i n

  1

P

t i

|

t i

 1

o

POS tagging with CMMs

[Ratnaparkhi 1996; etc.] • Left-to-right decomposition

P

t

1 ...

t

3 |

o

t

1 |

o

 

t

2 |

t

1

o

 

t

3 |

t

2

o

 – The local classifier uses the information on the preceding tag.

He runs fast

Examples of the features for local classification

Word unigram Word bigram Previous tag Tag/word Prefix/suffix Lexical features w i , w i-1 , w i+1 w i-1 w i , w i w i+1 t i-1 t i-1 w i Up to length 10 Hyphen, number, etc..

He runs fast PRP ?

POS tagging with Dependency Network

[Toutanova et al. 2003] t 1 t 2 t 3 • Use the information on the following tag as well

Score

(

t

1 ,...,

t n

|

o

) 

i n

  1

P

(

t i

|

t i

 1 ,

t i

 1 ,

o

) This is no longer a probability You can use the following tag as a feature in the local classification model

POS tagging with a Cyclic Dependency Network

[Toutanova et al. 2003] t 1 t 2 t 3 • Training cost is small – almost equal to CMMs.

• Decoding can be performed with dynamic programming, but it is still expensive.

• Collusion – the model can lock onto conditionally consistent but jointly unlikely sequences.

Bidirectional CMMs

[Tsuruoka and Tsujii, 2005] • Possible decomposition structures (a) t 1 t 2 t 3 (b) t 1 t 2 (c) t 1 t 2 t 3 (d) t 1 t 2 • Bidirectional CMMs – We can find the “ best ” structure and tag sequences in polynomial time t 3 t 3

Maximum entropy learning

• Log-linear modeling Feature weight Feature function

p

 

y

|

x

 

Z

1   exp   

i

i f i

   

Z

 

y

exp   

i

i f i

   

Maximum entropy learning

• Maximum likelihood estimation – Find the parameters that maximize the (log-) likelihood of the training data

LL

(  )  log

x

y

,

p

 

y

|

x

 • Smoothing – Gaussian prior [Berger et al, 1996] – Inequality constrains [Kazama and Tsujii, 2005]

Parameter estimation

• Algorithms for maximum entropy – GIS [Darroch and Ratcliff, 1972] , IIS [Della Pietra et al., 1997] • General-purpose algorithms for numerical optimization – BFGS [Nocedal and Wright, 1999] , LMVM [Benson and More, 2001] • You need to provide the objective function and gradient: –

Likelihood

of training samples –

Model expectation

of each feature

LL

(  )  log 

y x

, 

LL

(  )  

i

E

[

p

 

y

|

f i

] 

E p

[

x

f i

]

Computing likelihood and model expectation

• Example – Two possible tags: “ Noun ” and “ Verb ” – Two types of features: “ word ” and “ suffix ” He opened it Noun Verb Noun 

tag

verb

,

word

opened

tag

verb

,

suffix

ed

tag

noun

,

word

opened

tag

noun

,

suffix

ed

 

tag

verb

,

word

opened

tag

verb

,

suffix

ed

tag = noun tag = verb

Conditional Random Fields (CRFs)

• A single log-linear model on the whole sentence

P

(

t

1 ...

t n

|

o

)  1

Z

exp  

i F

  1 

i f i

    • One can use exactly the same techniques as maximum entropy learning to estimate the parameters.

• However, the number of classes is HUGE, and it is impossible in practice to do it in a naive way.

Conditional Random Fields (CRFs)

• Solution – Let ’ s restrict the types of features – Then, you can use a

dynamic programming

algorithm that drastically reduces the amount of computation • Features you can use (in first-order CRFs) – Features defined on the tag – Features defined on the adjacent pair of tags

Features

• Feature weights are associated with states & and edges Tag = Noun He has opened it

Noun Noun Noun Noun Verb

Tag

Verb

left = Noun

Verb

& Tag right = Noun

Verb

A naive way of calculating Z(x)

Noun Noun Noun Noun Noun Noun Noun Noun Noun Noun Noun Noun Verb Verb Verb Verb Noun Noun Verb Verb Noun Noun Verb Verb Noun Verb Noun Verb Noun Verb Noun Verb

= 7.2

= 1.3

= 4.5

= 0.9

= 2.3

= 11.2

= 3.4

= 2.5

Verb Verb Verb Verb Verb Verb Verb Verb Noun Noun Noun Noun Verb Verb Verb Verb Noun Noun Verb Verb Noun Noun Verb Verb Noun Verb Noun Verb Noun Verb Noun Verb

= 4.1

= 0.8

= 9.7

= 5.5

= 5.7

= 4.3

= 2.2

= 1.9

Sum = 67.5

Dynamic programming

• Results of intermediate computation can be reused.

He has opened it

Noun Noun Noun Noun Verb Verb Verb Verb

Maximum entropy learning and Conditional Random Fields

• Maximum entropy learning – Log-linear modeling + MLE – Parameter estimation • Likelihood of each sample • Model expectation of each feature • Conditional Random Fields – Log-linear modeling on the whole sentence – Features are defined on states and edges – Dynamic programming

Named Entity Recognition

We have shown that interleukin-1 (IL-1) and IL-2 control protein protein protein IL-2 receptor alpha (IL-2R alpha) gene transcription in DNA CD4-CD8-murine T lymphocyte precursors.

cell_line

Algorithms for Biomedical Named Entity Recognition

• Shared task data for Coling 2004 BioNLP workshop SVM+HMM (2004) Semi-Markov CRF [Okanohara et al., 2006] Sliding window MEMM (2004) CRF (2004) Recall 76.0

72.7

Precision 69.4

70.4

F-score 72.6

71.5

75.8

71.6

70.3

67.5

68.6

69.3

70.8

70.1

69.8

Domain adaptation

• Large training data has been available for general domains (e.g. Penn Treebank WSJ) • NLP Tools trained with general domain data are less accurate on biomedical domains • Development of domain-specific data requires considerable human efforts

Tagging errors made by a tagger trained on WSJ

… and membrane potential after mitogen binding.

CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence.

IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN • Accuracy of the tagger on the GENIA POS corpus: 84.4%

Re-training of maximum entropy models

• Taggers trained as

maximum entropy models p

 Model parameter

Z

1 exp  

i F

  1 

i f i

    Feature function (given by the developer) • Adapting Maximum entropy models to target domains by re-training with domain specific data

Methods for domain adaptation

• •

Combined training data

: a model is trained from scratch with the original and domain-specific data

Reference distribution

: an original model is used as a

reference probabilistic distribution

of a domain-specific model

p new

 1

Z p orig

(

x

) exp  

i F

  1 

i f i

   

Adaptation of the part-of-speech tagger

• Relationships among training and test data are evaluated for the following corpora –

WSJ

: Penn Treebank WSJ –

GENIA

: GENIA POS corpus [Kim et al., 2003] • 2,000 MEDLINE abstracts selected by MeSH terms,

Human, Blood cells,

and

Transcription factors

PennBioIE

: Penn BioIE corpus P450 family of enzymes • 1,157 MEDLINE abstracts [Kulick et al., 2004] • 1,100 MEDLINE abstracts about inhibition of the cytochrome about molecular genetics of cancer –

Fly

: 200 MEDLINE abstracts on

Drosophia melanogaster

Training and test sets

• Training sets WSJ GENIA PennBioIE Fly • Test sets # tokens 912,344 450,492 641,838 # sentences 38,219 18,508 29,422 1,024 WSJ GENIA PennBioIE Fly # tokens 129,654 50,562 70,713 7,615 # sentences 5,462 2,036 3,270 326

Experimental results

Accuracy

WSJ WSJ+GENIA +PennBioIE Fly only Combined Ref. dist

96.68

93.91

96.69

95.38

GENIA PennBioIE

98.10

98.12

98.17

97.65

97.65

96.93

Fly

96.35

97.94

98.08

Training time (sec.) 30,632 21

Corpus size vs. accuracy (combined training data)

99.0

98.5

98.0

97.5

97.0

96.5

96.0

95.5

95.0

8 Fly 16 32 64 128 256 Number of sentences 512 1024 WSJ GENIA Penn

99.0

98.5

98.0

97.5

97.0

96.5

96.0

95.5

95.0

94.5

94.0

Corpus size vs. accuracy (reference distribution)

8 16 Fly 32 64 128 256 Number of sentences WSJ GENIA 512 1024 Penn

Summary

• POS tagging – MEMM-like approaches achieve good performance with reasonable computational cost. CRFs seems to be too computationally expensive at present.

• Chunking – CRFs yield good performance for NP chunking. Semi Markov CRFs are promising, but we need to somehow reduce computational cost.

• Domain Adaptation – One can easily use the information about the original domain as the reference distribution.

References

• • • • • • • • A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. (1996). A maximum entropy approach to natural language processing. Computational Linguistics.

Adwait Ratnaparkhi. (1996). A Maximum Entropy Part-Of-Speech Tagger. Proceedings of EMNLP.

Thorsten Brants. (2000). TnT A Statistical Part-Of-Speech Tagger. Proceedings of ANLP. Taku Kudo and Yuji Matsumoto. (2001). Chunking with Support Vector Machines, Proceedings of NAACL.

John Lafferty, Andrew McCallum, and Fernando Pereira. (2001). Conditional Random Fields,, Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML. Michael Collins. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Proceedings of EMNLP. Fei Sha and Fernando Pereira. (2003). Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL.

K. Toutanova, D. Klein, C. Manning, and Y. Singer. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL.

References

• • • • • • Xavier Carreras and Lluis Marquez. (2003). Phrase recognition by filtering and ranking with perceptrons. Proceedings of RANLP.

Jes ú s Gim é nez and Llu í s M á rquez. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of LREC.

Sunita Sarawagi and William W. Cohen. (2004). Semimarkov conditional random fields for information extraction. Proceedings of NIPS 2004.

Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005). Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of HLT/EMNLP.

Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2006). Subdomain adaptation of a POS tagger with a small corpus. In Proceedings of HLT NAACL BioNLP Workshop.

Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. (2006). Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of COLING/ACL 2006.