Parsing with Compositional Vector Grammars

Transcript Parsing with Compositional Vector Grammars

Parsing with Compositional
Vector Grammars
Socher, Bauer, Manning, NG 2013
Problem
• How can we parse a sentence and create a
dense representation of it?
– N-grams have obvious problems, most important
is sparsity
• Can we resolve syntactic ambiguity with
context? “They ate udon with forks” vs “They
ate udon with chicken”
Standard Recursive Neural Net
[ Vector((I-like)green)]
W
Classifier?
Main
Score
[ Vector(I-like)]
W
Main
[ Vector(I)]
[ Vector(like)]
[ Vector(green)]
[ Vector(eggs)]
I
like
green
eggs
Standard Recursive Neural Net
𝑝𝑎𝑏 = 𝑓(𝑊𝑚𝑎𝑖𝑛
𝑎
+ 𝑏)
𝑏
Where 𝑓 ∙ is usually tanh(∙) or logistic()
In other words, stack the two word vectors and
multiply through a matrix W and you get a vector of the
same dimensionality as the children a or b.
Syntactically Untied RNN
Classifier
Score
[ Vector(green-eggs)]
[ Vector(I-like)]
W
W
N,V
adj,N
N
First, parse lower
level with PCFG
N
[ Vector(I)]
V
[ Vector(like)]
Adj
[ Vector(green)]
[ Vector(eggs)]
I
like
green
eggs
Syntactically Untied RNN
𝑝𝑎𝑏 = 𝑓(𝑊𝑐1,𝑐2
𝑎
+ 𝑏)
𝑏
The weight matrix is determined by the PCFG
parse category of a and b. (You have one per
parse combination)
Examples: Composition Matrixes
• Notice that he initializes them
with two identity matrixes (in the
absence of other information we
should average
Learning the Weights
• Errors are backpropagated through structure
(Goller and Kuchler, 1996)
input
𝑓 ′ (𝑥)
𝑒
𝑑𝑦𝑗
= 𝑥𝑖 (𝑦𝑗 (1−𝑦𝑗 ))(𝑦𝑗 − 𝑦𝑗 )
𝑑𝑊𝑖𝑗
(for logistic)
𝛿
Weight derivatives are additive across branches!
(Not obvious- good proof/explanation in Socher,
2014)
Tricks
• Our good friend, ada-grad (diagonal variant):
𝜃𝑡 = 𝜃𝑡−1 −
𝛼
2+𝑐
𝑔
𝑡−1
𝑔𝑡
(Elementwise)
• Initialize matrixes with identity + small random
noise
• Uses Collobert and Weston (2008) word
embeddings to start
Learning the Tree
• We want the scores of the correct parse trees
to be better than all incorrect trees by a
margin:
𝑠(𝐶𝑉𝐺 𝜃, 𝑥𝑖 , 𝑦𝑖 ≥ 𝑠 𝐶𝑉𝐺 𝜃, 𝑥𝑖 , 𝑦
+ ∆(𝑦𝑖 , 𝑦)
(Correct Parse Trees are Given in the Training
Set)
Finding the Best Tree (inference)
• Want to find the parse tree with the max
score (which is the sum all the scores of all sub
trees)
• Too expensive to try every combination
• Trick: use non-RNN method to select best 200
trees (CKY algorithm). Then, beam search
these trees with RNN.
Model Comparisons (WSJ Dataset)
(Socher’s Model)
F1 for parse labels
Analysis of Errors
Conclusions:
• Not the best model, but fast
• No hand engineered features
• Huge number of parameters:
𝑑 ∗ 𝑣𝑜𝑐𝑎𝑏 + 2𝑑 ∗ 𝑑 ∗ 𝑛𝑐𝑜𝑚𝑝 + 𝑑 ∗ 𝑐𝑙𝑎𝑠𝑠 + 𝑑
• Notice that Socher can’t make the standard RNN
perform better than the PCFG: there is a pattern
here. Most of the papers from this group involve
very creative modifications to the standard RNN.
(SU-RNN, RNTN, RNN+Max Pooling)
• The model in this paper has (probably) been
eclipsed by the Recursive Neural Tensor
Network. Subsequent work showed this
model performed better (in different
situations) than the SU-RNN