Transcript Milos

Learning Accurate, Compact,
and Interpretable Tree
Annotation
Recent Advances in Parsing Technology
WS 2011/2012
Saarland University in Saarbrücken
Miloš Ercegovčević
Outline

Introduction


EM algorithm
Latent Grammars


Motivation
Learning Latent PCFG


Split-Merge Adaptation
Efficient inference with Latent Grammars


Pruning in Multilevel Coarse-to-Fine parsing
Parse Selection
Introduction : EM Algorithm




Iterative algorithm for finding MLE or MAP estimates of parameters
in statistical models
X – observed data; Z – set of latent variables
Θ – a vector of unknown parametes
Likelihood function:


MLE of the marginal likelihood :


L ( ; X , Z )  p ( X , Z |  )
L ( ; X )  p ( X |  )   z p ( X , Z |  )
However this quantity is intractable
 Often we don’t know both Z and Θ
Introduction : EM Algorithm






Find the MLE of the marginal likelihood by iteratively
applying two steps:
Expectation step (E-step):
Calculate Z under current Θ
q ( | 
(t )
)  E Z | X ,  ( t ) [log L (  ; X , Z )]
Maximization step (M-step):
Find Θ that maximizes the quantity

( t 1)
 arg max Q (  | 

(t )
)
Latent PCFG


Standard coarse Treebank Tree
Baseline for parsing F1 72.6
Latent PCFG


Parent annotated trees [Johnson ’98], [Klein & Manning
’03]
F1 86.3
Latent PCFG


Head lexicalized [Collins ’99, Charniak ’00] trees
F1 88.6
Latent PCFG


Automatically clustered categories with F1 86.7
[Matsuzaki et al. ’05]
Same number of subcategories for all categories
Latent PCFG



At each step split the categories into two sets.
After 6 iterations number of subcategories is 64
Initialize EM with the results of the smaller grammar
Learning Latent PCFG
Forward



Induce subcategories
Like forward-backward
for HMMs
Fixed brackets
X1
X2
X3
X7
X4
X5
X6
.

S
He
was
right
Backward
Learning Latent Grammar

Inside-Outside probabilities
PIN ( r , t , A X )    ( A x  B y C z ) PIN ( r , s , B y ) PIN ( s , t , C z )
y,z
POUT ( r , t , B y )    ( A x  B y C z ) POUT ( r , t , A x ) PIN ( s , t , C z )
x,z
POUT ( s , t , C z )    ( A x  B y C z ) POUT ( r , t , A x ) PIN ( r , s , B y )
x,y
Learning Latent Grammar

Expectation step (E-step):
P (( r , s , t , A x  B y C z ) | w , T ) 

POUT ( r , t , A x )  ( A x  B y C z )
 PIN ( r , s , B y ) PIN ( s , t , C z )
Maximization step (M-step):
 ( A x  B y C z ) :
# { Ax  B y C z }
 y ', z ' # { A x  B y ' C z ' }
Latent Grammar : Adaptive splitting


Want to split more according to the data
Solution: Split everything then merge by the
loss
Data likelihood
with split reversed
Data likelihood

Without loss in Accuracy
with split
Latent Grammar : Adaptive splitting

The likelihood of data for tree T and sentence
w:
P ( w, T ) 
P
IN
( r , t , A x ) POUT ( r , t , A X )
x

Then for two annotations the overall loss can
be estimated as:
n
i
P ( w , Ti )
 ANNOTATION ( A1 , A 2 )   
i
i
n T i P ( w , T i )
0
LST
ROOT
X
WHADJP
RRC
SBARQ
INTJ
WHADVP
UCP
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
PRN
WHNP
QP
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
40
35
30
25
20
15
10
5
0
LST
ROOT
X
WHADJP
RRC
SBARQ
INTJ
WHADVP
UCP
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
N
P
PRN
30
WHNP
35
QP
40
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
VP
PP
25
20
15
10
5
0
LST
ROOT
X
WHADJP
RRC
NA
C
SBARQ
INTJ
WHADVP
10
UCP
15
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
PRN
WHNP
QP
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of Phrasal Subcategories
40
35
30
25
20
X
5
30
20
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
70
60
50
40
PO
S
T
O
,
10
60
50
40
30
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
70
R
B
VBx
IN
DT
20
10
70
60
50
40
30
0
NNP
JJ
NNS
NN
VBN
RB
VBG
VB
VBD
CD
IN
VBZ
VBP
DT
NNPS
CC
JJR
JJS
:
PRP
PRP$
MD
RBR
WP
POS
PDT
WRB
-LRB.
EX
WP$
WDT
-RRB''
FW
RBS
TO
$
UH
,
``
SYM
RP
LS
#
Number of Lexical Subcategories
NN
P
JJ
NN
S
N
N
20
10
Latent Grammar : Results
F1
≤ 40 words
F1
all words
Klein & Manning ’03
86.3
85.7
Matsuzaki et al. ’05
86.7
86.1
Collins ’99
88.6
88.2
Charniak & Johnson ’05
90.1
89.6
Petrov et al. ‘06
90.2
89.7
Parser
Efficient inference with Latent Grammars

Latent Grammar with 91.2 F1 score on Dev Set (1600
sentences) WSJ
 Training time 1621: more than a minute per sentence
 For usage in real-word applications this is to slow

Improve on inference:
 Hierarchical Pruning
 Parse Selection
Intermediate Grammars
DT
X-Bar=G0
DT1
G1
Learning
G2
G3
G4
G5
G=
G6
DT1
DT2
DT2
DT3
DT4
DT1 DT2 DT3 DT4 DT5 DT6 DT7 DT8
Projected Grammars
X-Bar=G0
0(G)
G1
1(G)
G3
G4
G5
G=
G6
2(G)
3(G)
4(G)
5(G)
G
Projection i
Learning
G2
Estimating Grammars
…
S1  NP1 VP1
S1  NP1 VP2
S1  NP2 VP1
S1  NP2 VP2
S2  NP1 VP1
S2  NP1 VP2
S2  NP2 VP1
S2  NP2 VP2
0.20
0.12
0.02
0.03
0.11
0.05
0.08
0.12
S  NP VP 0.56
Rules in (G)
Rules in G
…
Infinite tree distribution
Treebank
Hierarchical Pruning
Consider the span:
coarse:
…
split in two:
split in four:
split in eight:
…
…
QP
NP
VP
…
QP1 QP2 NP1 NP2 VP1 VP2
…
…
QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Parse Selection

Given a sentence w and a split PCFG grammar G select
the best parse that minimize our beliefs:
T  arg min
*
P
TP

 P (T
T
| w , G ) L (T P , TT )
TT
Intractable: we cannot generate all the TT
Parse Selection

Possible solutions


best derivation
generate n-best parses and re-rank them

sampling derivations of the grammar
select the minimum risk candidate based on loss
function of posterior marginals:
r ( A  BC , i , k , j )
q ( A  BC , i , k , j ) 
PIN ( root , 0 , n )
TG  arg max T  q ( e )

eT
Results
Thank You!
References





S. Petrov, L. Barrett, R. Thibaux, D Klein. Learning Accurate, Compact, and
Interpretable Tree Annotation, COLING-ACL 2006 slides.
S. Petrov and D. Klein, NACL Improved Inference for Unlexicalized Parsing
: 2007 slides.
S. Petrov, L. Barrett, R. Thibaux, and D. Klein. 2006. Learning accurate,
compact, and interpretable tree annotation. In COLING-ACL ’06, pages
443–440.
S. Petrov and D. Klein. 2007. Improved Inference for Unlexicalized Parsing .
In NACL ’06.
T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilistic CFG with latent
annotations. In ACL ’05, pages 75–82.