Transcript Document

Prototype-Driven Learning for Sequence Models

Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley

Overview Unlabeled Data Prototype List Target Label + Annotated Data

Sequence Modeling Tasks Information Extraction: Classified Ads Size Restrict Terms Location Features 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Paid water and garbage.

No dogs allowed. Prototype List FEATURE LOCATION TERMS SIZE RESTRICT kitchen, laundry near, close paid, utilities large, feet cat, smoking

Sequence Modeling Tasks English POS NN IN VBN NNS CC IN JJ NNP CD RB PUNC DET Prototype List NN VBD CC NNP JJ DET president said and Mr.

new the IN NNS TO PUNC CD VBP of shares to . million are

Generalizing Prototypes ¼ the

witness

reported

¼ president ¼ said NN DET VBD president the said • Tie each word to its most similar prototype

Generalizing Prototypes y: DET x: a NN VBD witness reported ‘reported’ Æ VBD suffix 2=‘ed’ Æ VBD sim=‘said’ Æ VBD

Markov Random Fields for Unlabeled Data

Markov Random Fields y: DET x: NN VBD a witness reported x: input sentence y: hidden labels

Markov Random Fields y: DET NN VBD x: a ‘a’ Æ DET sim=‘the’ Æ DET suffix 1=‘a’ Æ DET witness reported

Markov Random Fields y: DET x: NN VBD a witness reported ‘witness’ Æ NN sim=‘president’ Æ NN suffix 2=‘ss’ Æ NN

Markov Random Fields y: DET x: a NN VBD witness reported ‘reported’ Æ VBD sim=‘said’ Æ VBD suffix 2=‘ed’ Æ VBD

Markov Random Fields y: DET x: NN VBD a witness reported DET Æ NN Æ VBD

Markov Random Fields y: DET NN VBD x: a witness reported DET Æ NN Æ VBD ‘witness Æ NN score  (x,y) = exp(  suffix 2=‘ed’ Æ VBD T suffix 1=‘a’ Æ DET ‘a’ Æ DET )

Markov Random Fields •

Joint Probability Model

p  (x,y) = score  (x,y) / Z(  ) •

Partition Function

Z(  ) =  x,y score  (x,y) Sum over infinite inputs!

Objective Function • Given unlabeled sentences {x 1 ,…,x n } choose  to maximize

Optimization • • Forward Backward Algorithm Algorithm for Lattices the a in …… the witness in …… the reported in ……

Partition Function • Length Lattice • • Compute sum for fixed length Lattice Forward Backward [Smith & Eisner 05]

?

the of In … • Approximation • Truncate to finite length

?

?

+ the In … the In …

?

the In …

?

+ the In … the of In …

?

the In …

?

?

the In … the of In …

?

Experiments

English POS Experiments • • Data • 193K tokens (about 8K sentences) of WSJ portion of Penn Treebank Features [Smith & Eisner 05] • Trigram tagger • Word type, suffixes up to length 3, contains hyphen, contains digit, initial capitalization

English POS Experiments • Fully Unsupervised • • Random initialization Greedy label remapping

BASE BASE 41.3

0 20 40 60 Accuracy 80 100

English POS Experiments • Prototype List • 3 prototypes per tag • Automatically extracted by frequency

English POS Distributional Similarity • Judge a word by the company it keeps the president said a downturn is near -1 +1 • Collect context counts from 40M words of WSJ president the ___ said :0.6 a ___ reported :0.3

• Similarity [Schuetze 93] • • SVD dimensionality reduction cos(  ) similarity measure

English POS Experiments • Add similarity features

PROTO+SIM

• Top five most similar prototypes that exceed threshold

PROTO+SIM 80.5

67.8% on non-prototype accuracy

BASE 41.3

0 20 40 60 Accuracy 80 100

English POS Transition Counts Target Structure Learned Structure

Classified Ads Experiments • Data • 100 ads (about 119K tokens) from [Grenager et. al. 05] • Features • • Trigram tagger Word type

Classified Ads Experiments • Fully Unsupervised • • Random initialization Greedy label remapping

BASE BASE 46.4

0 20 40 60 Accuracy 80 100

Classified Ads Experiments • Prototype List • 3 prototypes per tag • 33 words in total • Automatically extracted by frequency

Classified Ads Distributional Similarity • Different from English POS the president said a downturn is near -1 +1 • Similar to topic model walking distance to shopping , public transportation

Classified Ads Experiments • Add similarity features

PROTO + SIM PROTO+SIM 71.5

BASE 0 20 46.4

40 Accuracy 60 80 100

Reacting to observed errors • Boundary Model Location Terms .

.

. Paid • Augment Prototype List Boundary , ; .

Classified Ads Experiments • Add Boundary field

BOUND BOUND 74.1

PROTO+SIM 71.5

BASE 46.4

0 20 40 60 Accuracy 80 100

Information Extraction Transition Counts Target Structure Learned Structure

Conclusion • Prototype-Driven learning • Novel flexible weakly-supervised learning framework • Merged distributional clustering techniques with supervised structured models

Thanks!

Questions?

English POS Experiments • Fix Prototypes to their tag • No random initialization • No remapping

PROTO

47.7% on non-prototype accuracy

BASE PROTO

41.3

68.8

40 50 60 Accuracy 70

Classified Ads Experiments • Fix Prototypes to their tag • No random initialization • No remapping

PROTO BASE PROTO

46.4

53.7

40 50 60 Accuracy 70 80

Objective Function • Sum over hidden labels • Forward-Backward Algorithm ?

?

?

a witness reported

Objective Function • Infinite sum over all lengths of input ?

?

+ ?

?

?

?

+ ?

?

• Can be computed exactly under certain conditions ?

?

?

?

……

English POS Distributional Similarity • Collect context counts form BLIPP corpus president downturn the ___ said : 0.6 a ___ reported : 0.3

a ___ was : 0.8

a ___ is : 0.2

a the ___ witness : 0.6 said ___ downturn : 0.3

___ president : 0.8 said ___ downturn : 0.2

• Similarity [Schuetze 93] • • SVD dimensionality reduction cos(  ) between context vectors