CS546: Machine Learning and Natural Language Preparation

Download Report

Transcript CS546: Machine Learning and Natural Language Preparation

CS546: Machine Learning and Natural Language
Latent-Variable Models for Structured Prediction Problems:
Syntactic Parsing
Slides / Figures from Slav Petrov’s talk
at COLING-ACL 06 are used in this
lecture
1
Parsing Problem
• Annotation refines base treebank symbols to
improve statistical fit of the grammar
• Parent annotation [Johnson 98]
2
Parsing Problem
• Annotation refines base treebank symbols to
improve statistical fit of the grammar
• Parent annotation [Johnson 98]
• Head lexicalization [Collins 99,...]
3
Parsing Problem
• Annotation refines base treebank symbols to
improve statistical fit of the grammar
• Parent annotation [Johnson 98]
• Head lexicalization [Collins 99,...]
• Automatic Annotation [Matsuzaki et al, 05;...]
• Manual Annotation [Klein and Manning 03]
4
Manual Annotation
• Manually split categories
– NP: subject vs object
– DT: determiners vs demonstratives
– IN: sentential vs prepositional
• Advantages:
– Fairly compact grammar
– Linguistic motivations
• Disadvantages:
– Performance leveled out
– Manually annotated
Model
Naïve Treebank PCFG
Klein & Manning ’03
F1
72.6
5
86.3
Automatic Annotation
• Use Latent Variable Models
– Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T])
– Each node in the tree is annotated with a latent sub-category:
– Latent Annotated Probablistic CFG: To obtain the probability of a tree
you need to sum over all the latent variables
6
How to perform this clustering?
• Estimating model parameters (and models structure)
– Decide how do you split each terminal (what is T in ., NP -> ( NP[1],
NP[2],...,NP[T])
– Estimate probabilities
for all
• Parsing:
– Do you need the most likely ‘annotated’ parse tree (1) or the most
likely tree with non-annotated nodes (2)?
– Usually (2), but the inferred latent variables can can be useful for
other tasks
7
Estimating the model
• Estimating parameters:
– If we decide on the structure of the model (how we split) we can use
EM (Matsuzaki et al, 05; Petrov and Klein, 06; ...):
• E-Step: estimate
- obtain
fractional counts of rules
• M-Step:
– Also can use variational methods (mean-field): [Titov and Henderson,
07; Liang et al, 07]
• Recall: We considered the variational methods in the context of
LDA
8
Estimating the model
• How to decide on how many nodes to split?
– Early models split all the nodes equally [Kurihara and Sato, 04;
Matsuzaki et al, 05; Prescher 05,...] with T selected by hand
– Models are sparse (parameter estimates are not reliable), parsing time
is large
9
Estimating the model
• How to decide on how many nodes to split?
– Later different approaches were considered:
• (Petrov and Klein 06): Split and merge approach – recursively split
each node in 2, if the likelihood is (significantly) improved – keep,
otherwise, merge back; continue until no improvement
• (Liang et al 07): Use Dirichlet Processes to automatically infer the
appropriate size of the grammar
– Larger is the training set: more fine grain the annotation is
10
Estimating the model
• How to decide on how many nodes to split?
• (Titov and Henderson 07; current work):
– Instead of annotating with a single label annotate with a
binary vector:
- log-linear models for
instead of counts
of productions
- - can be large: standard Gaussian regularization to avoid
overtraining
– efficient approximate parsing algorithms
11
How to parse?
• Do you need the most likely ‘annotated’ parse tree (1) or the
most likely tree with non-annotated nodes (2)?
• How to parse:
– (1) – easy – just usual parsing with the extended grammar
(if all nodes split in T)
– (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]),
– instead you can do Minimum Bayes Risk decoding (i.e., output the
minimum loss tree [Goodman 96; Titov and Henderson, 06; Petrov
and Klein 07]) => instead of predicting the best tree you output the
tree with the minimal expected error
(Not always a great idea because we often do not know good loss
measures: like optimizing the Hamming loss for sequence labeling can
lead to lingustically non-plausible structures)
12
Adaptive splitting
• (Petrov and Klein, 06): Split and Merge: number of induced
constituent labels:
NP
40
VP
35
30
PP
25
20
15
10
LST
ROOT
X
WHADJP
RRC
SBARQ
INTJ
WHADVP
UCP
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
PRN
WHNP
QP
SBAR
ADJP
S
ADVP
PP
VP
0
NP
5
13
30
20
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Adaptive splitting
• (Petrov and Klein, 06): Split and Merge: number of induced
POS tags:
70
60
50
40
POS
,
TO
10
14
50
40
30
0
NNP
JJ
NNP
NNS
JJ
NN
NNS
VBN
NN
RB
VBN
VBG
RB
VB
VBG
VBD
VB
CD
VBD
IN
CD
VBZ
IN
VBZ
VBP
VBP
DT
DT
NNPS
NNPS
CC
CC
JJR
JJR
JJS
JJS:
:
PRP
PRP
PRP$
PRP$
MD
MD
RBR
RBR
WP
WP
POS
POS
PDT
PDT
WRB
WRB
-LRB-LRB.
.
EX
EX
WP$
WP$
WDT
WDT
-RRB-RRB''''
FW
FW
RBS
RBS
TO
TO
$$
UH
UH
,,
``
``
SYM
SYM
RP
RP
LS
LS
##
Adaptive splitting
• (Petrov and Klein, 06): Split and Merge: number of induced
POS tags:
70
70
60
50
30
20 20
NNP
JJ
NNS
60
NN
40
POS
,
TO
10 10
0
15
Induced POS-tags
 Proper Nouns (NNP):
NNP-14
Oct.
Nov.
Sept.
NNP-12
John
Robert
James
NNP-2
J.
E.
L.
NNP-1
Bush
Noriega
Peters
NNP-15
New
San
Wall
NNP-3
York
Francisco
Street
 Personal pronouns (PRP):
PRP-0
It
He
I
PRP-1
it
he
they
PRP-2
it
them
him
16
Induced POS tags
• Relative adverbs (RBR):
RBR-0
further
lower
higher
RBR-1
more
less
More
RBR-2
earlier
Earlier
later
• Cardinal Numbers (CD):
CD-7
one
two
Three
CD-4
1989
1990
1988
CD-11
million
billion
trillion
CD-0
1
50
100
CD-3
1
30
31
CD-9
78
58
34
17
Results for this model
F1
≤ 40
words
F1
all words
Klein & Manning ’03
86.3
85.7
Matsuzaki et al. ’05
86.7
86.1
Collins ’99
88.6
88.2
Charniak & Johnson
’05
90.1
89.6
Petrov & Klein, 06
90.2
89.7
Parser
18
LVs in Parsing
• In standard models for parsing (and other structured prediction problems)
you need to decide how the structure decomposes into the parts (e.g.,
weighted CFGs / PCFGs)
• In latent variable models you relax this assumption: you assume how the
structure annotated with latent variables decomposes
• In other words, you learn to construct composite features from the
elementary features (parts) -> reduces feature engineering effort
• Latent variable models become popular in many applications:
– syntactic dependency parsing [Titov and Henderson, 07] – best single model
system in the parsing competition (overall 3rd result out of 22 systems)
(CoNLL-2007)
– joint semantic role labeling and parsing [Henderson et al, 09] – again the best
single model (1st result in parsing, 3rd result in SRL) (CoNLL-2009)
– hidden (dynamics) CRFs [Quattoni, 09]
– ...
19
Hidden CRFs
•
CRF (Lafferty et al, 2001):
No long-distance statistical
dependencies between y
•
Latent Dynamic CRF
Long-distance
dependencies can be
encoded using latent
vectors
20
Latent Variables
• Drawbacks:
– Learning LVs models usually involves using slower iterative algorithms
(EM, Variation methods, sampling...)
– Optimization problem is often non-convex – many local minima
– Inference (decoding) can be more expensive
• Advantages:
– Reduces feature engineering effort
– Especially preferable if little domain knowledge is available and
complex features are needed
– Induced representation can be used for other tasks (e.g., LA-PCFGs
induce fine-grain grammar can be useful, e.g., for SRL)
– Latent variables (= hidden representations) can be useful in muti-task
learning: hidden representation is induced simultaneously for several
tasks [Collobert and Weston, 2008; Titov et al, 2009].
21
Conclusions
• We considered latent variable models in different contexts:
– Topic modeling
– Structured prediction models
• We demonstrated where and why they are useful
• Reviewed basic inference/learning techniques:
– EM-type algorithms
– Variational approximations
– Sampling
• Only very basic review
• Next time: a guest lecture by Ming-Wei Chang on DomainAdaptation (really hot and important topic in NLP!)
22