Generative and Discriminative Models in NLP: A Survey Kristina Toutanova Computer Science Department Stanford University.

Download Report

Transcript Generative and Discriminative Models in NLP: A Survey Kristina Toutanova Computer Science Department Stanford University.

Generative and Discriminative
Models in NLP: A Survey
Kristina Toutanova
Computer Science Department
Stanford University
Motivation

Many problems in natural language processing
are disambiguation problems

word senses
jaguar – a big cat, a car, name of a Java package
line - phone, queue, in mathematics, air line, etc.
part-of-speech tags (noun, verb, proper noun,
NN etc.)VBZ
NN

NNP
?
NNS
?
DT
VB
NN
?
Joy makes progress every day .
Motivation

Parsing – choosing preferred phrase structure trees
for sentences, corresponding to likely semantics
S
VP
?
NP
VBD
NNP
PP
IN
“I”

“saw”
NNP
“Mary” “with” “the” “telescope”
Possible approaches to disambiguation


Encode knowledge about the problem, define rules,
hand-engineer grammars and patterns (requires much
effort, not always possible to have categorical answers)
Treat the problem as a classification task and learn
classifiers from labeled training data
Overview





General ML perspective
Examples
The case of Part-of-Speech Tagging
The case of Syntactic Parsing
Conclusions
The Classification Problem
Given a training set of iid samples T={(X1,Y1) …
(Xn,Yn)} of input and class variables from an
unknown distribution D(X,Y), estimate a
hˆ( X )
function that predicts the class from the
input variables
hˆ( X )
The goal is to come up with a hypothesis
with minimum expected loss (usually 0-1
err(hˆ)   D( X , Y ) (Y  hˆ( X ))
loss)  X ,Y 
Under
lossDthe
hypothesis
with minimum
h( X ) 0-1
arg max
(Y | X
)
Y  loss is the Bayes optimal classifier
expected
Approaches to Solving
Classification Problems - I
1.
Generative. Try to estimate the probability
distribution of the data D(X,Y)

specify a parametric model family
{P ( X , Y ) :   }

choose parameters
ˆ by maximum
likelihood on training data
L(T |  ) 
n
 P ( X
i
, Yi )
i 1

estimate conditional probabilities by Bayes
rule
P
ˆ (Y | X )  Pˆ ( X , Y ) / Pˆ ( X )




classify new instances
Pˆ (Y | X ) to the most probable
class Y according to
Approaches to Solving
Classification Problems - I
2. Discriminative. Try to estimate the
conditional distribution D(Y|X) from data.
specify a parametric model family
{P (Y | X ) :   }

estimate parameters
by maximumn
ˆ
conditional likelihood of CL
training
(T |  , X )data
  P (Yi | X i )
i 1

classify new instances to the most probable
class Y according
Pˆ (Yto
| X)
3. Discriminative. Distribution-free. Try to
hˆ( X )

estimate
directly from data so that
its expected loss will be minimized
Axes for comparison of
different approaches





Asymptotic accuracy
Accuracy for limited training data
Speed of convergence to the best
hypothesis
Complexity of training
Modeling ease
Generative-Discriminative
Pairs
Definition: If a generative and discriminative parametric
model family can represent the same set of conditional
P (Y | X )
probability distributions
they are a generativediscriminative pair
Example: Naïve Bayes and Logistic Regression
Y
Y {1,2,...,K }
X 1 , X 2 {0,1}
PNB (Y  i | X 1 , X 2 ) 
P(Y  i) P( X 1 | Y  i) P( X 2 | Y  i)
 P(Y  i ) P( X1 | Y  i ) P( X 2 | Y  i )
i 1...K
PLR (Y  i | X 1 , X 2 ) 
exp(i1 X 1  i 2 X 2  i 0 )
 exp(i 1 X 1  i 2 X 2  i 0 )
i 1...K
X1
X2
Comparison of Naïve Bayes
and Logistic Regression

The NB assumption that features are
independent given the class is not made by
logistic regression
P ( X , X | Y  i )  P( X | Y  i ) P( X | Y  i )
NB
1
2
PLR ( X 1 , X 2 | Y  i) 
1
2
P( X 1 , X 2 )
exp(i1 X 1  i 2 X 2  i 0 )
P(Y  i)  exp(i 1 X 1  i 2 X 2  i 0 )
i 1...K

The logistic regression model is more general
because it allows a larger class of probability
distributions for the features given classes
Example: Traffic Lights
Reality
Lights Working
Lights Broken
~
P
P(g,r,w) = 3/7
NB Model
Working?
NS
EW
P(r,g,w) = 3/7


P(r,r,b) = 1/7
Model assumptions false!
JL and CL estimates differ…
JL: P(w) = 6/7 CL: P~ (w) = 
~ (r|w) = ½
P(r|w) = ½
P
~ (r|b) = 1
P(r|b) = 1
P
Joint Traffic Lights
Lights Working
3/14
3/14
3/14
3/14
2/14
0
0
0
Lights Broken
Conditional likelihood
of working is > ½!
Incorrectly assigned!
Accuracy: 6/7
Conditional
likelihood of
working is 1
Conditional Traffic Lights
Lights Working
/4
1-
/4
/4
0
0
Lights Broken
Now correctly
assigned to broken.
/4
0
Conditional likelihood
of working is still 1
Accuracy: 7/7
CL perfect (1)
JL low (to 0)
Comparison of Naïve Bayes
and Logistic Regression
Naïve Bayes
Accuracy
Logistic
Regression
+
Convergence
+
Training Speed
+
Model
assumptions
independence of
features given
class
P( X 1 ,log-odds
X 2 | Y  i)
Linear
log(
)
Advantages
Faster
convergence, uses
information in
P(X), faster training
More robust and
accurate because
fewer
assumptions
Disadvantages
Large bias if the
independence
assumptions are
Harder parameter
estimation
problem, ignores
P( X 1 , X 2 | Y  j )
Some Experimental
Comparisons
50
LR
NB
45
40
60
LR
NB
50
40
35
30
30
20
25
20
10
15
0
training data size
Ng & Jordan 2002
(15 datasets from UCI ML)
5
1000 2000
training
data3000
size 4000
Klein & Manning 2002
(WSD line and hard data)
Part-of-Speech Tagging
POS tagging is determining the part of speech of every word in a
sentence.
NN
VBZ
NN
DT
NN
NNP
NNS
VB
?
?
?
Joy
makes progress every
day .
Sequence classification problem with 45 classes (Penn Treebank).
Accuracies are high 97%! Some argue it can’t go much higher
Existing approaches:



rule-based (hand-crafted, TBL)
generative (HMM)
discriminative (maxent, memory-based, decision tree, neural
network, linear models(boosting,perceptron) )
Part-of-Speech Tagging
Useful Features
The complete solution of the problem requires full
syntactic and semantic understanding of sentences
 In most cases information about surrounding
words/tags is strong disambiguator
“The long fenestration was tiring . “
 Useful features






tags of previous/following words
P(NN|JJ)=.45;P(VBP|JJ)=0.0005
identity of word being tagged/surrounding words
suffix/prefix for unknown words, hyphenation,
capitalization
longer distance features
others we haven’t figured out yet
HMM Tagging Models - I
Independence Assumptions
t1
t2
t3
•
ti is independent of t1…ti-2 and
w1…wi-1 given ti-1
w1
w2
w3
•
words are independent given
their tags
states can be single tags or pairs of successive tags or variable
length sequences of last tags
t
uw
Cap?
suffix
unknown words (Weischedel et al. 93)
hyph
HMM Tagging Models Brants 2000



Highly competitive with other state-of-the art models
Trigram HMM with smoothed transition probabilities
Capitalization feature becomes part of the state –
each tag state is split into two e.g.
NN


t
<NN,cap>,<NN,not cap>
Suffix features for unknown words
P(w | tag)  P(suffix | tag)(w | suffix)
~
 Pˆ (suffix) P (tag | suffix) / Pˆ (tag)
suffixn
suffixn-1
suffix2
~
P(tag | suffixn )  1Pˆ (tag | suffixn )  2 Pˆ (tag | suffixn1 )  ... n Pˆ (tag)
suffix1
CMM Tagging Models
Independence Assumptions
t1
w1
t2
w2
t3
w3
•
ti is independent of t1…ti-2 and
w1…wi-1 given ti-1
•
ti is independent of all
following observations
•
no independence assumptions
on the observation sequence
•Dependence of current tag on previous and future
observations can be added; overlapping features of the
observation can be taken as predictors
MEMM Tagging Models -II
Ratnaparkhi (1996)
 local distributions are estimated using maximum
entropy models
 used previous two tags, current word, previous two
words, next two words
 suffix, prefix, hyphenation, and capitalization
features for unknown words
Model
Overall
Accuracy
Unknown
Words
HMM (Brants 2000)
96.7
85.5
MEMM(Ratn 1996)
96.63
85.56
MEMM(T&M 2000)
96.86
86.91
HMM vs CMM – I
Johnson (2001)
Model
Accuracy
tj
tj+1
wj
wj+1
tj
tj+1
wj
wj+1
tj
tj+1
wj
wj+1
95.5%
94.4%
95.3%
HMM vs CMM - II

The per-state conditioning of the CMM has
been observed to exhibit label bias (Bottou,
Lafferty) and observation bias (Klein &
Manning )
Klein & Manning (2002)
HMM
CMM
CMM+
91.23
89.22
90.44
t1
t2
t3
w1
w2
w3
Unobserving words with
unambiguous tags improved
performance significantly
Conditional Random Fields
(Lafferty et al 2001)





Models that are globally conditioned on the
observation sequence; define distribution P(Y|X) of
tag sequence given word sequence
No independence assumptions about the
observations; no need to model their distribution
The labels can depend on past and future
observations
Avoids the independence assumption of CMMs that
labels are independent of future observations and
thus the label and observation bias problems
The parameter estimation problem is much harder
CRF - II


t1
t2
t3
w1
w2
w3
HMM and this chain CRF form a generativediscriminative pair
Independence assumptions : a tag is independent of
all other tags in the sequence given its neighbors
and the word sequence n
P(t1...t n | w1...wn ) 
exp( (t j t j1   t j w j ))
j 1

t1 ...t n T n
n
exp( (t j t j1   t j w j ))
j 1
CRF-Experimental Results
Model
Accuracy
Unknown
Word
Accuracy
HMM
94.31%
54.01%
CMM (MEMM)
93.63%
45.39%
CRF
94.45%
51.95%
CMM+
(MEMM+)
95.19%
73.01%
CRF+
95.73%
76.24%
Discriminative Tagging
Model – Voted Perceptron
Collins 2002; Best reported tagging results on
WSJ
 Uses all features used by Ratnaparkhi (96)
 (h , t ) {0,1}
s

i
i
Learns a linear function
F ( w1n , t1n , ) 
 s (w1n , t1n ) 

  ( w
s 1..d
s
s
1n
, t1n )
 (h , t )
i 1..n
s
i
i
Classifies according to
t1n  arg max F (w1n , t1n , )
t1.. n T n

Error MEMM(Ratn 96) 96.72% V Perceptron 97.11%
Summary of Tagging Review
For tagging, the change from generative to discriminative
model does not by itself result in great improvement
(e.g. HMM and CRF)
One profits from discriminative models for specifying
dependence on overlapping features of the observation
such as spelling, suffix analysis,etc
The CMM model allows integration of rich features of the
observations, but suffers strongly from assuming
independence from following observations; this effect can
be relieved by adding dependence on following words
This additional power (of the CMM ,CRF, Perceptron models)
has been shown to result in improvements in accuracy
though not dramatic (up to 11% error reduction)
The higher accuracy of discriminative models comes at the
price of much slower training
More research is needed on specifying useful features (or
tagging WSJ Penn Treebank is a noisy task and the limit is
Parsing Models

Syntactic parsing is the task of assigning a parse tree
to a sentence corresponding to its most likely
S
interpretation
VP
NP
VBD
NNP
PP
IN
“I”

“saw”
NNP
“Mary” “with” “the” “telescope”
Existing approaches




hand-crafted rule-based heuristic methods
probabilistic generative models
conditional probabilistic discriminative models
discriminative ranking models
Generative Parsing Models


Generative models based on PCFG grammars learned
from corpora are still among the best performing
(Collins 97,Charniak 97,00) 88% -89% labeled
precision/recall
The generative models learn a distribution P(X,Y) on
<sentence, parse tree> pairs:
P( X , Y ) 
 P(expansion(n) | history(n))
nnodes (Y )
and select a single most likely parse for a sentence X
based
Ybest on:
arg max P( X , Y )
Y :yield(Y )  X


Easy to train using RFE for maximum likelihood
These models have the advantage of being usable as
language models (Chelba&Jelinek 00, Charniak 00)
Generative History-Based
Model – Collins 97
TOP
S(bought)
NP(week)
NP-C(Marks)
Accuracy <= 100 words
88.1% LP 87.5% LR
VP(bought)
NP-C(Brooks)
JJ(Last)
NN(week)
“Last”
“week”
NNP(Marks)
“Marks”
VBD(bought)
NNP(Brooks)
“bought”
“Brooks”
Discriminative models
Shift-reduce parser Ratnaparkhi (98)
 Learns a distribution P(T|S) of parse trees given
sentences using the sequence of actions of a shiftreduce parser
n
P(T | S )   P(ai | a1...ai 1S )
i 1




Uses a maximum entropy model to learn conditional
distribution of parse action given history
Suffers from independence assumptions that actions
are independent of future observations as CMM
Higher parameter estimation cost to learn local
maximum entropy models
Lower but still good accuracy 86% - 87% labeled
precision/recall
Discriminative Models –
Distribution Free Re-ranking


Represent sentence-parse tree pairs by a feature
vector F(X,Y)

Learn a linear ranking model with parameters
using the boosting loss
Model
LP
Collins 99
(Generative)
88.3%
Collins 00
(BoostLoss)
89.9%
LR
88.1%
89.6%
13% error
reduction
Still very close
in accuracy to
generative
model
(Charniak 00)
Comparison of GenerativeDiscriminative Pairs
Johnson (2001) have compared simple PCFG
trained to maximize L(T,S) and L(T|S)
A Simple PCFG has parameters
  {ij  P( Ai    j | Ai ),

ij
 1,  i}
j 1...m
Models:
MLE  arg max  Log( P(Ti , Si )),  
i

i 1..n
MCLE  arg max  Log( P(Ti | Si )),  

i 1..n
Results: Model
LPrecision LRecall
MLE
0.815
0.789
MCLE
0.817
0.794
Unification-Based Grammars
-I

Unification-based grammars (UBG) are often
defined using a context-free base and a set
of path equations
S[number X] -> NP[number X] VP[number X]
NP[number X] -> N [number X]
VP[number X] ->V[number X]
N[number sg]-> dog ; N[number pl] ->dogs;
V[number sg] ->barks ; V[number pl] ->bark;


A PCFG grammar can be defined using the
context-free backbone CFGUBG(S-> NP, VP)
The UBG generates “dogs bark” and “dog
barks”. The CFGUBG generates “dogs bark”
,“dog barks”, “dog bark”, and “dogs barks” .
Unification-Based Grammars
- II
A Simple PCFG for CFGUBG has parameters from the set
  {ij  P( Ai    j | Ai ),

j 1...mi
ij
 1,  i}
It defines a joint distribution P(T,S) and a conditional
distributions of trees given sentences

An   Expn
nnodes (T )
PPCFG (T | S ) 

T CFGUBG

An   Expn
,Yield (T   S ) nnodes (T  )
A conditional weighted CFG defines only a conditional
probability; the conditional probability of any tree T
outside the UBG is 0
PCWCFGUBG (T | S ) 

An   Expn
nnodes (T )


An   Expn
T UBG ,Yield (T   S ) nnodes (T  )
Weighted CFGs for Unification-based
grammars - III
85
79,3
80
81,8
76,7
75
70
Generative
66,3
65
Conditional loglinear
60
55
50
47,7 48,7
45
HMM
Tagger
PCFG-S
PCFG-A
The conditional weighted CFGs perform consistently better than
their generative counterparts
Negative information is extremely helpful here; knowing that
the conditional probability of trees outside the UBG is zero plus
conditional training amounts to 38% error reduction for the
simple PCFG model
Summary of Parsing Results




The single small study comparing a parsing generativediscriminative pair for PCFG parsing showed a small
(insignificant) advantage for the discriminative model; the
added computational cost is probably not worth it
The best performing statistical parsers are still
generative(Charniak 00, Collins 99) or use a generative model
as a preprocessing stage(Collins 00, Collins 2002) (part of
which has to do with computational complexity)
Discriminative models allow more complex representations
such as the all subtrees representation (Collins 2002) or other
overlapping features (Collins 00) and this has led to up to 13%
improvement over a generative model
Discriminative training seems promising for parse selection
tasks for UBG, where the number of possible analyses is not
enormous
Conclusions




For the current sizes of training data available for
NLP tasks such as tagging and parsing,
discriminative training has not by itself yielded large
gains in accuracy
The flexibility of including non-independent features
of the observations in discriminative models has
resulted in improved part-of-speech tagging models
(for some tasks it might not justify the added
computational complexity)
For parsing, discriminative training has shown
improvements when used for re-ranking or when
using negative information (UBG)
if you come up with a feature that is very hard to
incorporate in a generative models and seems
extremely useful, see if a discriminative approach
will be computationally feasible !