Transcript Milos
Learning Accurate, Compact,
and Interpretable Tree
Annotation
Recent Advances in Parsing Technology
WS 2011/2012
Saarland University in Saarbrücken
Miloš Ercegovčević
Outline
Introduction
EM algorithm
Latent Grammars
Motivation
Learning Latent PCFG
Split-Merge Adaptation
Efficient inference with Latent Grammars
Pruning in Multilevel Coarse-to-Fine parsing
Parse Selection
Introduction : EM Algorithm
Iterative algorithm for finding MLE or MAP estimates of parameters
in statistical models
X – observed data; Z – set of latent variables
Θ – a vector of unknown parametes
Likelihood function:
MLE of the marginal likelihood :
L ( ; X , Z ) p ( X , Z | )
L ( ; X ) p ( X | ) z p ( X , Z | )
However this quantity is intractable
Often we don’t know both Z and Θ
Introduction : EM Algorithm
Find the MLE of the marginal likelihood by iteratively
applying two steps:
Expectation step (E-step):
Calculate Z under current Θ
q ( |
(t )
) E Z | X , ( t ) [log L ( ; X , Z )]
Maximization step (M-step):
Find Θ that maximizes the quantity
( t 1)
arg max Q ( |
(t )
)
Latent PCFG
Standard coarse Treebank Tree
Baseline for parsing F1 72.6
Latent PCFG
Parent annotated trees [Johnson ’98], [Klein & Manning
’03]
F1 86.3
Latent PCFG
Head lexicalized [Collins ’99, Charniak ’00] trees
F1 88.6
Latent PCFG
Automatically clustered categories with F1 86.7
[Matsuzaki et al. ’05]
Same number of subcategories for all categories
Latent PCFG
At each step split the categories into two sets.
After 6 iterations number of subcategories is 64
Initialize EM with the results of the smaller grammar
Learning Latent PCFG
Forward
Induce subcategories
Like forward-backward
for HMMs
Fixed brackets
X1
X2
X3
X7
X4
X5
X6
.
S
He
was
right
Backward
Learning Latent Grammar
Inside-Outside probabilities
PIN ( r , t , A X ) ( A x B y C z ) PIN ( r , s , B y ) PIN ( s , t , C z )
y,z
POUT ( r , t , B y ) ( A x B y C z ) POUT ( r , t , A x ) PIN ( s , t , C z )
x,z
POUT ( s , t , C z ) ( A x B y C z ) POUT ( r , t , A x ) PIN ( r , s , B y )
x,y
Learning Latent Grammar
Expectation step (E-step):
P (( r , s , t , A x B y C z ) | w , T )
POUT ( r , t , A x ) ( A x B y C z )
PIN ( r , s , B y ) PIN ( s , t , C z )
Maximization step (M-step):
( A x B y C z ) :
# { Ax B y C z }
y ', z ' # { A x B y ' C z ' }
Latent Grammar : Adaptive splitting
Want to split more according to the data
Solution: Split everything then merge by the
loss
Data likelihood
with split reversed
Data likelihood
Without loss in Accuracy
with split
Latent Grammar : Adaptive splitting
The likelihood of data for tree T and sentence
w:
P ( w, T )
P
IN
( r , t , A x ) POUT ( r , t , A X )
x
Then for two annotations the overall loss can
be estimated as:
n
i
P ( w , Ti )
ANNOTATION ( A1 , A 2 )
i
i
n T i P ( w , T i )
0
LST
ROOT
X
WHADJP
RRC
SBARQ
INTJ
WHADVP
UCP
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
PRN
WHNP
QP
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
40
35
30
25
20
15
10
5
0
LST
ROOT
X
WHADJP
RRC
SBARQ
INTJ
WHADVP
UCP
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
N
P
PRN
30
WHNP
35
QP
40
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
VP
PP
25
20
15
10
5
0
LST
ROOT
X
WHADJP
RRC
NA
C
SBARQ
INTJ
WHADVP
10
UCP
15
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
PRN
WHNP
QP
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
40
35
30
25
20
X
5
30
20
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
70
60
50
40
PO
S
T
O
,
10
60
50
40
30
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
70
R
B
VBx
IN
DT
20
10
70
60
50
40
30
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
NN
P
JJ
NN
S
N
N
20
10
Latent Grammar : Results
F1
≤ 40 words
F1
all words
Klein & Manning ’03
86.3
85.7
Matsuzaki et al. ’05
86.7
86.1
Collins ’99
88.6
88.2
Charniak & Johnson ’05
90.1
89.6
Petrov et al. ‘06
90.2
89.7
Parser
Efficient inference with Latent Grammars
Latent Grammar with 91.2 F1 score on Dev Set (1600
sentences) WSJ
Training time 1621: more than a minute per sentence
For usage in real-word applications this is to slow
Improve on inference:
Hierarchical Pruning
Parse Selection
Intermediate Grammars
DT
X-Bar=G0
DT1
G1
Learning
G2
G3
G4
G5
G=
G6
DT1
DT2
DT2
DT3
DT4
DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8
Projected Grammars
X-Bar=G0
0(G)
G1
1(G)
G3
G4
G5
G=
G6
2(G)
3(G)
4(G)
5(G)
G
Projection i
Learning
G2
Estimating Grammars
…
S1 NP1 VP1
S1 NP1 VP2
S1 NP2 VP1
S1 NP2 VP2
S2 NP1 VP1
S2 NP1 VP2
S2 NP2 VP1
S2 NP2 VP2
0.20
0.12
0.02
0.03
0.11
0.05
0.08
0.12
S NP VP 0.56
Rules in (G)
Rules in G
…
Infinite tree distribution
Treebank
Hierarchical Pruning
Consider the span:
coarse:
…
split in two:
split in four:
split in eight:
…
…
QP
NP
VP
…
QP1 QP2 NP1 NP2 VP1 VP2
…
…
QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Parse Selection
Given a sentence w and a split PCFG grammar G select
the best parse that minimize our beliefs:
T arg min
*
P
TP
P (T
T
| w , G ) L (T P , TT )
TT
Intractable: we cannot generate all the TT
Parse Selection
Possible solutions
best derivation
generate n-best parses and re-rank them
sampling derivations of the grammar
select the minimum risk candidate based on loss
function of posterior marginals:
r ( A BC , i , k , j )
q ( A BC , i , k , j )
PIN ( root , 0 , n )
TG arg max T q ( e )
eT
Results
Thank You!
References
S. Petrov, L. Barrett, R. Thibaux, D Klein. Learning Accurate, Compact, and
Interpretable Tree Annotation, COLING-ACL 2006 slides.
S. Petrov and D. Klein, NACL Improved Inference for Unlexicalized Parsing
: 2007 slides.
S. Petrov, L. Barrett, R. Thibaux, and D. Klein. 2006. Learning accurate,
compact, and interpretable tree annotation. In COLING-ACL ’06, pages
443–440.
S. Petrov and D. Klein. 2007. Improved Inference for Unlexicalized Parsing .
In NACL ’06.
T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilistic CFG with latent
annotations. In ACL ’05, pages 75–82.