Transcript Document
Prototype-Driven Learning for Sequence Models
Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley
Overview Unlabeled Data Prototype List Target Label + Annotated Data
Sequence Modeling Tasks Information Extraction: Classified Ads Size Restrict Terms Location Features 2 Bdrms/1 Bath, spacious upper unit, located in Hilltop Paid water and garbage.
No dogs allowed. Prototype List FEATURE LOCATION TERMS SIZE RESTRICT kitchen, laundry near, close paid, utilities large, feet cat, smoking
Sequence Modeling Tasks English POS NN IN VBN NNS CC IN JJ NNP CD RB PUNC DET Prototype List NN VBD CC NNP JJ DET president said and Mr.
new the IN NNS TO PUNC CD VBP of shares to . million are
Generalizing Prototypes ¼ the
witness
reported
¼ president ¼ said NN DET VBD president the said • Tie each word to its most similar prototype
Generalizing Prototypes y: DET x: a NN VBD witness reported ‘reported’ Æ VBD suffix 2=‘ed’ Æ VBD sim=‘said’ Æ VBD
Markov Random Fields for Unlabeled Data
Markov Random Fields y: DET x: NN VBD a witness reported x: input sentence y: hidden labels
Markov Random Fields y: DET NN VBD x: a ‘a’ Æ DET sim=‘the’ Æ DET suffix 1=‘a’ Æ DET witness reported
Markov Random Fields y: DET x: NN VBD a witness reported ‘witness’ Æ NN sim=‘president’ Æ NN suffix 2=‘ss’ Æ NN
Markov Random Fields y: DET x: a NN VBD witness reported ‘reported’ Æ VBD sim=‘said’ Æ VBD suffix 2=‘ed’ Æ VBD
Markov Random Fields y: DET x: NN VBD a witness reported DET Æ NN Æ VBD
Markov Random Fields y: DET NN VBD x: a witness reported DET Æ NN Æ VBD ‘witness Æ NN score (x,y) = exp( suffix 2=‘ed’ Æ VBD T suffix 1=‘a’ Æ DET ‘a’ Æ DET )
Markov Random Fields •
Joint Probability Model
p (x,y) = score (x,y) / Z( ) •
Partition Function
Z( ) = x,y score (x,y) Sum over infinite inputs!
Objective Function • Given unlabeled sentences {x 1 ,…,x n } choose to maximize
Optimization • • Forward Backward Algorithm Algorithm for Lattices the a in …… the witness in …… the reported in ……
Partition Function • Length Lattice • • Compute sum for fixed length Lattice Forward Backward [Smith & Eisner 05]
?
the of In … • Approximation • Truncate to finite length
?
?
+ the In … the In …
?
the In …
?
+ the In … the of In …
?
the In …
?
?
the In … the of In …
?
Experiments
English POS Experiments • • Data • 193K tokens (about 8K sentences) of WSJ portion of Penn Treebank Features [Smith & Eisner 05] • Trigram tagger • Word type, suffixes up to length 3, contains hyphen, contains digit, initial capitalization
English POS Experiments • Fully Unsupervised • • Random initialization Greedy label remapping
BASE BASE 41.3
0 20 40 60 Accuracy 80 100
English POS Experiments • Prototype List • 3 prototypes per tag • Automatically extracted by frequency
English POS Distributional Similarity • Judge a word by the company it keeps the president said a downturn is near -1 +1 • Collect context counts from 40M words of WSJ president the ___ said :0.6 a ___ reported :0.3
• Similarity [Schuetze 93] • • SVD dimensionality reduction cos( ) similarity measure
English POS Experiments • Add similarity features
PROTO+SIM
• Top five most similar prototypes that exceed threshold
PROTO+SIM 80.5
67.8% on non-prototype accuracy
BASE 41.3
0 20 40 60 Accuracy 80 100
English POS Transition Counts Target Structure Learned Structure
Classified Ads Experiments • Data • 100 ads (about 119K tokens) from [Grenager et. al. 05] • Features • • Trigram tagger Word type
Classified Ads Experiments • Fully Unsupervised • • Random initialization Greedy label remapping
BASE BASE 46.4
0 20 40 60 Accuracy 80 100
Classified Ads Experiments • Prototype List • 3 prototypes per tag • 33 words in total • Automatically extracted by frequency
Classified Ads Distributional Similarity • Different from English POS the president said a downturn is near -1 +1 • Similar to topic model walking distance to shopping , public transportation
Classified Ads Experiments • Add similarity features
PROTO + SIM PROTO+SIM 71.5
BASE 0 20 46.4
40 Accuracy 60 80 100
Reacting to observed errors • Boundary Model Location Terms .
.
. Paid • Augment Prototype List Boundary , ; .
Classified Ads Experiments • Add Boundary field
BOUND BOUND 74.1
PROTO+SIM 71.5
BASE 46.4
0 20 40 60 Accuracy 80 100
Information Extraction Transition Counts Target Structure Learned Structure
Conclusion • Prototype-Driven learning • Novel flexible weakly-supervised learning framework • Merged distributional clustering techniques with supervised structured models
Thanks!
Questions?
English POS Experiments • Fix Prototypes to their tag • No random initialization • No remapping
PROTO
47.7% on non-prototype accuracy
BASE PROTO
41.3
68.8
40 50 60 Accuracy 70
Classified Ads Experiments • Fix Prototypes to their tag • No random initialization • No remapping
PROTO BASE PROTO
46.4
53.7
40 50 60 Accuracy 70 80
Objective Function • Sum over hidden labels • Forward-Backward Algorithm ?
?
?
a witness reported
Objective Function • Infinite sum over all lengths of input ?
?
+ ?
?
?
?
+ ?
?
• Can be computed exactly under certain conditions ?
?
?
?
……
English POS Distributional Similarity • Collect context counts form BLIPP corpus president downturn the ___ said : 0.6 a ___ reported : 0.3
a ___ was : 0.8
a ___ is : 0.2
a the ___ witness : 0.6 said ___ downturn : 0.3
___ president : 0.8 said ___ downturn : 0.2
• Similarity [Schuetze 93] • • SVD dimensionality reduction cos( ) between context vectors