Contrastive Estimation: (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and.

Download Report

Transcript Contrastive Estimation: (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and.

Contrastive Estimation:
(Efficiently) Training Log-Linear
Models (of Sequences) on
Unlabeled Data
Noah A. Smith and Jason Eisner
Department of Computer Science /
Center for Language and Speech Processing
Johns Hopkins University
{nasmith,jason}@cs.jhu.edu
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Nutshell Version
unannotated
text
tractable
training
contrastive estimation
with lattice neighborhoods
Experiments on unlabeled data:
“max ent” features
sequence models
POS tagging: 46% error rate
reduction (relative to EM)
“Max ent” features make it possible
to survive damage to tag dictionary
Dependency parsing: 21%
attachment error reduction
(relative to EM)
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
“Red leaves don’t hide blue jays.”
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Maximum Likelihood Estimation
(Supervised)
p
JJ
NNS
MD
VB
JJ
NNS
y
red
leaves
don’t
hide
blue
jays
x
p
?
?
*
Σ* ×
Λ*
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Maximum Likelihood Estimation
(Unsupervised)
p
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
p
This is what
EM does.
?
?
*
Σ* ×
Λ*
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
x
Focusing Probability Mass
numerator
denominator
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Focusing Probability Mass
numerator
denominator
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Conditional Estimation
(Supervised)
p
p
(x) ×
Λ*
JJ
NNS
MD
VB
JJ
NNS
y
red
leaves
don’t
hide
blue
jays
x
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
A different
denominator!
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Objective Functions
Objective
Optimization
Algorithm
Numerator
Denominator
MLE
Count &
Normalize*
tags & words
Σ* × Λ*
MLE with
hidden
variables
EM*
words
Σ* × Λ*
Conditional
Likelihood
Iterative
Scaling
tags & words
(words) × Λ*
Perceptron
Backprop
tags & words
hypothesized
tags & words
*For generative models.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Objective Functions
Objective
MLE
MLE with
hidden
Contrastive
variables
Estimation
Conditional
Likelihood
Perceptron
Optimization
Algorithm
Count &
Normalize*
generic
numerical
EM*
solvers
(inIterative
this talk,
LMVM
Scaling
L-BFGS)
Backprop
Numerator
Denominator
tags
& words
observed
data
Σ* × Λ*
(in words
this talk,
raw word
sequence,
tags
words
sum&over
all
possible
taggings)
tags
& words
Σ* × Λ*
?
(words) × Λ*
hypothesized
tags & words
*For generative models.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
This talk is about denominators ...
in the unsupervised case.
A good denominator can improve
accuracy
and
tractability.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Language Learning (Syntax)
red
leaves
EM
don’t
hide
blue
jays
Why didn’t he say,
“birds fly” or “dancing granola”
or “the wash dishes” or
any other sequence of words?
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Language Learning (Syntax)
red
leaves
don’t
hide
blue
jays
Why did he pick that sequence for
those words?
Why not say “leaves red ...” or
“... hide don’t ...” or ...
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
What is a syntax model supposed to explain?
Each learning
hypothesis
corresponds to
a denominator
/ neighborhood.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Job of Syntax
“Explain why each word is necessary.”
→ DEL1WORD neighborhood
red don’t hide blue jays
leaves don’t hide blue jays
red leaves hide blue jays
red leaves don’t hide blue jays
red leaves don’t hide blue
red leaves don’t blue jays
red leaves don’t hide jays
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Job of Syntax
“Explain the (local) order of the words.”
→ TRANS1 neighborhood
red don’t leaves hide blue jays
leaves red don’t hide blue jays
red leaves don’t hide blue jays
red leaves hide don’t blue jays
red leaves don’t hide jays blue
red leaves don’t blue hide jays
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
p
p
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
?
?
?
?
?
?
leaves
red
don’t
hide
blue
jays
?
?
?
?
?
?
red
don’t
leaves
hide
blue
jays
?
?
?
?
?
?
red
leaves
hide
don’t
blue
jays
?
?
?
?
?
?
red
leaves
don’t
blue
hide
jays
?
?
?
?
?
?
red
leaves
don’t
hide
jays
blue
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
sentences in
TRANS1
neighborhood
p
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
leaves
don’t
hide
blue
jays
don’t
hide
blue
jays
red
p
(with any tagging)
sentences in
TRANS1
neighborhood
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The New Modeling Imperative
A good
sentence hints
that a set of
bad ones is
nearby.
numerator
denominator
(“neighborhood”)
“Make the good sentence
likely, at the expense
of those bad neighbors.”
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
This talk is about denominators ...
in the unsupervised case.
A good denominator can improve
accuracy
and
tractability.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Log-Linear Models
score of x, y
partition
function
Computing Z
is undesirable!
Conditional
Estimation
(Supervised)
1 sentence
Contrastive
Estimation
(Unsupervised)
a few
sentences
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Sums over all
possible
taggings of all
possible
sentences!
A Big Picture: Sequence Model Estimation
unannotated data
tractable sums
generative,
EM: p(x)
log-linear,
EM: p(x)
log-linear,
MLE: p(x, y)
generative,
MLE: p(x, y)
log-linear,
conditional
estimation:
p(y | x)
overlapping
features
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
log-linear,
CE with
lattice
neighborhoods
Contrastive Neighborhoods
• Guide the learner toward models that do
what syntax is supposed to do.
• Lattice representation → efficient algorithms.
There is an art
to choosing
neighborhood
functions.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Neighborhoods
neighborhood
size
lattice
arcs
DEL1WORD
n+1
O(n)
delete up to 1 word
TRANS1
n
O(n)
transpose any bigram
DELORTRANS1
O(n)
O(n)
perturbations
DEL1WORD  TRANS1
DEL1SUBSEQUENCE O(n2) O(n2) delete any contiguous subsequence
Σ*
(EM)
∞
-
replace each word with anything
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The Merialdo (1994) Task
Given unlabeled text
and a POS dictionary
(that tells all possible tags for each word type),
A form of
learn to tag.
supervision.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Trigram Tagging Model
JJ
NNS
MD
VB
JJ
NNS
red
leaves
don’t
hide
blue
jays
feature set:
tag trigrams
tag/word pairs from a POS dictionary
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
supervised
CRF
99.5
HMM
97.2
LENGTH
79.3
TRANS1
79.0
DELORTRANS1
78.8
10 × data
DA
Smith & Eisner (2004)
EM
Merialdo (1994)
EM
70.0
66.6
62.1
DEL1WORD
60.4
DEL1SUBSEQUENCE
random
≈ log-linear
EM
58.7
35.1
• 96K words
30.0
40.0
50.0
60.0
70.0
80.0
90.0
• full POS dictionary
tagging accuracy (ambiguous words)
• uninformative initializer
• best of 8 smoothing conditions
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
100.0
Dictionary includes ...
■ all words
■ words from 1st half of corpus
■ words with count  2
■ words with count  3
What if we
damage
the POS
dictionary?
Dictionary excludes
OOV words,
which can get any tag.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
55.0
• 96K words
• 17 coarse POS tags
• uninformative initializer
90.4
69.5
84.4
80.5
51.0
60.0
Dictionary excludes
OOV words,
which can get any tag.
60.5
56.6
65.0
■ all words
■ words from 1st half of corpus
■ words with count  2
■ words with count  3
66.5
70.0
Dictionary includes ...
70.9
75.0
78.3
75.2
72.3
80.0
77.2
85.0
84.8
81.3
tagging accuracy (all words)
90.0
90.1
95.0
50.0
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
Trigram Tagging Model + Spelling
JJ
NNS
MD
VB
JJ
NNS
red
leaves
don’t
hide
blue
jays
feature set:
tag trigrams
tag/word pairs from a POS dictionary
1- to 3-character suffixes, contains hyphen, digit
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
55.0
69.5
73.8
73.6
83.2
91.9
91.1
90.8
90.3
89.8
84.4
80.5
90.4
51.0
60.0
60.5
56.6
65.0
66.5
70.0
70.9
75.0
78.3
75.2
72.3
80.0
77.2
85.0
84.8
81.3
tagging accuracy (all words)
90.0
90.1
95.0
Spelling
features aided
recovery, but
only with a
smart
neighborhood.
50.0
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
The model need not be finite-state.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
35.2
42.1
23.6
33.8
48.7
50.0
40.0
attachment accuracy
Klein & Manning (2004)
Unsupervised Dependency Parsing
37.4
30.0
EM
20.0
10.0
LENGTH
0.0
unin
TRANS1
See our paper
at the IJCAI 2005
Grammatical Inference
workshop. ACL 2005
initializer
cleve
r
form
a
• N. A. Smith and J. Eisner • Contrastive Estimation
tive
To Sum Up ...
Contrastive Estimation means
picking your own denominator
for tractability
or for accuracy
(or, as in our case, for both).
Now we can use the task to guide the unsupervised learner
(like discriminative techniques do for supervised learners).
It’s a particularly good fit for log-linear models:
unsupervised sequence models
with max ent features
all in time for ACL 2006.
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation
ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation