Dependency Parsing by Belief Propagation David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University)

Download Report

Transcript Dependency Parsing by Belief Propagation David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University)

Dependency Parsing
by Belief Propagation
David A. Smith (JHU  UMass Amherst)
Jason Eisner (Johns Hopkins University)
1
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
2
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
3
Word Dependency Parsing
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP
VBZ
DT
JJ
NN
NN
MD
VB
TO
RB
CD
CD
IN
NNP
.
Word dependency parsing
Word dependency parsed sentence
He reckons the current account deficit will narrow to only 1.8 billion in September .
SUBJ
MOD
MOD
MOD
SUBJ
COMP
COMP
SPEC
S-COMP
ROOT
slide adapted from Yuji Matsumoto
4
What does parsing have to do
with belief propagation?
loopy
belief propagation
loopy
belief propagation
5
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
6
Great ideas in NLP: Log-linear models
(Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models.
p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
each choice depends on a limited part of the history
p(D | A,B,C)?
p(D | A,B,C)?
… p(D | A,B) * p(C | A,B,D)?
but which dependencies to allow?
what if they’re all worthwhile?
7
Great ideas in NLP: Log-linear models
(Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models.
p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …
which dependencies to allow? (given limited training data)

Solution: Log-linear (max-entropy) modeling
(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B)
* Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) *
…throw them all in!


Features may interact in arbitrary ways
Iterative scaling keeps adjusting the feature weights
until the model agrees with the training data.
8
How about structured outputs?



Log-linear models great for n-way classification
Also good for predicting sequences
v
a
n
find
preferred
tags
but to allow fast dynamic
programming,
only use n-gram features
Also good for dependency parsing
…find preferred links…
but to allow fast dynamic
programming or MST parsing,
only use single-edge features
9
How about structured outputs?
…find preferred links…
but to allow fast dynamic
programming or MST parsing,
only use single-edge features
10
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?
yes, lots of green ...
Byl
jasný studený dubnový den a hodiny odbíjely třináctou
“It was a bright cold day in April and the clocks were striking thirteen”
11
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?
jasný  den
(“bright day”)
Byl
jasný studený dubnový den a hodiny odbíjely třináctou
“It was a bright cold day in April and the clocks were striking thirteen”
12
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?
jasný  N
jasný  den
(“bright NOUN”)
(“bright day”)
Byl
V
jasný studený dubnový den a hodiny odbíjely třináctou
A
A
A
N
J
N
V
C
“It was a bright cold day in April and the clocks were striking thirteen”
13
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?
jasný  N
jasný  den
(“bright NOUN”)
(“bright day”)
AN
Byl
V
jasný studený dubnový den a hodiny odbíjely třináctou
A
A
A
N
J
N
V
C
“It was a bright cold day in April and the clocks were striking thirteen”
14
Edge-Factored Parsers (McDonald et al. 2005)

Is this a good edge?
jasný  N
jasný  den
(“bright
A
N day”)
(“bright NOUN”)
preceding
conjunction
Byl
V
AN
jasný studený dubnový den a hodiny odbíjely třináctou
A
A
A
N
J
N
V
C
“It was a bright cold day in April and the clocks were striking thirteen”
15
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?
not as good, lots of red ...
Byl
V
jasný studený dubnový den a hodiny odbíjely třináctou
A
A
A
N
J
N
V
C
“It was a bright cold day in April and the clocks were striking thirteen”
16
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?
jasný  hodiny
(“bright clocks”)
... undertrained ...
Byl
V
jasný studený dubnový den a hodiny odbíjely třináctou
A
A
A
N
J
N
V
C
“It was a bright cold day in April and the clocks were striking thirteen”
17
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?
jasný  hodiny
jasn  hodi
(“bright clocks”)
(“bright clock,”
stems only)
... undertrained ...
Byl
V
jasný studený dubnový den a hodiny odbíjely třináctou
A
byl jasn
A
A
stud
dubn
N
J
den a
N
V
C
hodi
odbí
třin
“It was a bright cold day in April and the clocks were striking thirteen”
18
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?
jasný  hodiny
jasn  hodi
(“bright clocks”)
(“bright clock,”
stems only)
... undertrained ...
Byl
V
Aplural  Nsingular
jasný studený dubnový den a hodiny odbíjely třináctou
A
byl jasn
A
A
stud
dubn
N
J
den a
N
V
C
hodi
odbí
třin
“It was a bright cold day in April and the clocks were striking thirteen”
19
Edge-Factored Parsers (McDonald et al. 2005)

How about this competing edge?
jasn  hodi
jasný  hodiny
(“bright
clocks”)
AN
(“bright clock,”
stems only)
where
N follows ...
... undertrained
a conjunction
Byl
V
Aplural  Nsingular
jasný studený dubnový den a hodiny odbíjely třináctou
A
byl jasn
A
A
stud
dubn
N
J
den a
N
V
C
hodi
odbí
třin
“It was a bright cold day in April and the clocks were striking thirteen”
20
Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better?

“bright day” or “bright clocks”?
Byl
V
jasný studený dubnový den a hodiny odbíjely třináctou
A
byl jasn
A
A
stud
dubn
N
J
den a
N
V
C
hodi
odbí
třin
“It was a bright cold day in April and the clocks were striking thirteen”
21
Edge-Factored Parsers (McDonald et al. 2005)



our current weight vector
Which edge is better?
Score of an edge e =   features(e)
Standard algos  valid parse with max total score
Byl
jasný studený dubnový den a hodiny odbíjely třináctou
V
A
A
A
byl
jasn
stud
dubn
N
J
den a
N
V
C
hodi
odbí
třin
“It was a bright cold day in April and the clocks were striking thirteen”
22
Edge-Factored Parsers (McDonald et al. 2005)



our current weight vector
Which edge is better?
Score of an edge e =   features(e)
Standard algos  valid parse with max total score
can’t have both
(one parent per word)
Can’t have all three
(no cycles)
can‘t have both
(no crossing links)
Thus, an edge may lose (or win)
because of a consensus of other
edges.
23
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
24
Finding Highest-Scoring Parse


Convert to context-free grammar (CFG)
Then use dynamic programming
The cat in the hat wore a stovepipe. ROOT
let’s vertically stretch
this graph drawing
ROOT
wore
cat
The
stovepipe
in
a
hat
the
each subtree is a linguistic constituent
(here a noun phrase)
25
Finding Highest-Scoring Parse


Convert to context-free grammar (CFG)
Then use dynamic programming


CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case




to score “cat  wore” link, not enough to know this is NP
must know it’s rooted at “cat”
so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...}
ROOT constant 
so CKY’s “grammar constant” is no longer
wore
cat
The
stovepipe
in
a
hat
the
each subtree is a linguistic constituent
(here a noun phrase)
26
Finding Highest-Scoring Parse


Convert to context-free grammar (CFG)
Then use dynamic programming



CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
Solution: Use a different decomposition (Eisner 1996)

Back to O(n3)
ROOT
wore
cat
The
stovepipe
in
a
hat
the
each subtree is a linguistic constituent
(here a noun phrase)
27
Spans vs. constituents
Two kinds of substring.
» Constituent of the tree: links to the rest
only through its headword (root).
The cat in the hat wore a stovepipe. ROOT
» Span of the tree: links to the rest
only through its endwords.
The cat in the hat wore a stovepipe. ROOT
28
Decomposing a tree into spans
The cat in the hat wore a stovepipe. ROOT
The cat
+
cat in the hat wore a stovepipe. ROOT
cat in the hat wore
cat in
+
+
wore a stovepipe. ROOT
in the hat wore
in the hat
+
hat wore
Finding Highest-Scoring Parse


Convert to context-free grammar (CFG)
Then use dynamic programming



CKY algorithm for CFG parsing is O(n3)
Unfortunately, O(n5) in this case
Solution: Use a different decomposition (Eisner 1996)


Back to O(n3)
Can play usual tricks for dynamic programming parsing

Further refining the constituents or spans

Allow prob. model to keep track of even more internal information

A*, best-first, coarse-to-fine

Training by EM etc.
require “outside” probabilities
of constituents, spans, or links
30
Hard Constraints on Valid Trees
our current weight vector


Score of an edge e =   features(e)
Standard algos  valid parse with max total score
can’t have both
(one parent per word)
Can’t have all three
(no cycles)
can‘t have both
(no crossing links)
Thus, an edge may lose (or win)
because of a consensus of other
edges.
31
Non-Projective Parses
ROOT
I
‘ll
give
a
talk
tomorrow
on bootstrapping
subtree rooted at “talk”
is a discontiguous noun phrase
can‘t have both
(no crossing links)
The “projectivity” restriction.
Do we really want it?
32
Non-Projective Parses
ROOT
I
‘ll
give
a
talk
tomorrow
on bootstrapping
occasional non-projectivity in English
ROOT
ista
meam
norit
gloria
canitiem
thatNOM
myACC
may-know
gloryNOM
going-grayACC
That glory may-know my going-gray
(i.e., it shall last till I go gray)
frequent non-projectivity in Latin, etc.
33
Finding highest-scoring non-projective tree



Consider the sentence “John saw Mary” (left).
The Chu-Liu-Edmonds algorithm finds the maximumweight spanning tree (right) – may be non-projective.
Can be found in time O(n2).
9
root
root
10
10
30
9
30
saw
20
30
saw
0
30
Mary
John
11
3
slide thanks to Dragomir Radev
John
Mary
Every node selects best parent
If cycles, contract them and repeat
34
Summing over all non-projective trees
Finding highest-scoring non-projective tree






Consider the sentence “John saw Mary” (left).
The Chu-Liu-Edmonds algorithm finds the maximumweight spanning tree (right) – may be non-projective.
Can be found in time O(n2).
How about total weight Z of all trees?
How about outside probabilities or gradients?
Can be found in time O(n3) by matrix determinants
and inverses (Smith & Smith, 2007).
slide thanks to Dragomir Radev
35
Graph Theory to the Rescue!
O(n3) time!
Tutte’s Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian)
adjacency matrix of directed graph G without row and
column r is equal to the sum of scores of all directed
spanning trees of G rooted at node r.
Exactly the Z we need!
36
Building the Kirchoff
(Laplacian) Matrix

s(1,0) s(2,0)
s(2,0)
00s (1s(1,0)
, j )  s (2,1)


j) s(2,1)
j
100  s(1,
0
s(2,1)
s (1,j12)

s (2, j )

0 j)
s(1,2) j  2  s(2,
00 s(1,2)
 
j2




n)  ss(2,n)
(2, n)
00s (1,s(1,n)
s(1,n)
s(2,n)









s(n,0)
s(n,0)
 s (n,1
)

s(n,1) 
s(n,1)
 s (n,2)

s(n,2)
s(n,2)







s
(
n
,
j)
0

s(n,
j)


jn

jn

• Negate edge scores
• Sum columns
(children)
• Strike root row/col.
• Take determinant
N.B.: This allows multiple children of root, but see Koo et al. 2007.
37
Why Should This Work?
Clear for 1x1 matrix; use induction
 s(1, j)
j1
s(1,2)
s(2,1)
s(n,1)
 s(2, j)
s(n,2)
Chu-Liu-Edmonds analogy:
Every node selects best parent
If cycles, contract and recur
j2
s(1,n)
 s(n, j)
s(2,n)
jn
K  K with contracted edge 1,
2
K  K({1,2} |{1,2})
K  s(1,2) K   K 
Undirected case; special root cases for directed
38
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
39
Exactly Finding the Best Parse
…find preferred links…

but to allow fast dynamic
programming or MST parsing,
only use single-edge features
With arbitrary features, runtime blows up

Projective parsing: O(n3) by dynamic programming
grandparents

grandp.
+ sibling
bigrams
POS
trigrams
sibling pairs
(non-adjacent)
O(n5)
O(n4)
O(n3g6) … O(2n)
Non-projective: O(n2) by minimum spanning tree
NP-hard
• any of the above features
• soft penalties for crossing links
• pretty much anything else!
40
Let’s reclaim our freedom (again!)
This paper in a nutshell

Output probability is a product of local factors

Throw in any factors we want! (log-linear model)
(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ

How could we find best parse?

Integer linear programming (Riedel et al., 2006)


MCMC


doesn’t give us probabilities when training or parsing
Slow to mix? High rejection rate because of hard TREE constraint?
Greedy hill-climbing (McDonald & Pereira 2006)
none of these exploit
tree structure of parses
as the first-order methods do
41
Let’s reclaim our freedom (again!)
This paper in a nutshell

certain global factors ok too
Output probability is a product of local factors

Throw in any factors we want! (log-linear model)
(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ

Let local factors negotiate via “belief propagation”

Links (and tags) reinforce or suppress one another
Each iteration takes total time O(n2) or O(n3)
each global factor can be handled fast via some
traditional parsing algorithm (e.g., inside-outside)

Converges to a pretty good (but approx.) global parse
42
Let’s reclaim our freedom (again!)
This paper in a nutshell
Training with many features Decoding with many features
Iterative scaling
Belief propagation
Each weight in turn is
influenced by others
Each variable in turn is
influenced by others
Iterate to achieve
globally optimal weights
Iterate to achieve
locally consistent beliefs
To train distrib. over trees,
use dynamic programming
to compute normalizer Z
To decode distrib. over trees,
use dynamic programming
to compute messages New!
43
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
44
Local factors in a graphical model

First, a familiar example

Conditional Random Field (CRF) for POS tagging
Possible tagging (i.e., assignment to remaining variables)
…
v
v
v
…
preferred
find
tags
Observed input sentence (shaded)
45
Local factors in a graphical model

First, a familiar example

Conditional Random Field (CRF) for POS tagging
Possible tagging (i.e., assignment to remaining variables)
Another possible tagging
…
v
a
n
…
preferred
find
tags
Observed input sentence (shaded)
46
Local factors in a graphical model

First, a familiar example

Conditional Random Field (CRF) for POS tagging
”Binary” factor
that measures
compatibility of 2
adjacent tags
v
v 0
n 2
a 0
n
2
1
3
a
1
0
1
v
v 0
n 2
a 0
n
2
1
3
a
1
0
1
Model reuses
same parameters
at this position
…
find
…
preferred
tags
47
Local factors in a graphical model

First, a familiar example

Conditional Random Field (CRF) for POS tagging
“Unary” factor evaluates this tag
Its values depend on corresponding word
…
…
v 0.2
n 0.2
a 0
find
preferred
tags
can’t be adj
48
Local factors in a graphical model

First, a familiar example

Conditional Random Field (CRF) for POS tagging
“Unary” factor evaluates this tag
Its values depend on corresponding word
…
…
v 0.2
n 0.2
a 0
find
preferred
tags
(could be made to depend on
entire observed sentence)
49
Local factors in a graphical model

First, a familiar example

Conditional Random Field (CRF) for POS tagging
“Unary” factor evaluates this tag
Different unary factor at each position
…
…
v 0.3
n 0.02
a 0
find
v 0.3
n 0
a 0.1
preferred
v 0.2
n 0.2
a 0
tags
50
Local factors in a graphical model

First, a familiar example

Conditional Random Field (CRF) for POS tagging
p(v a n) is proportional
to the product of all
factors’ values on v a n
…
v
v 0
n 2
a 0
v
a
1
0
1
v
v 0
n 2
a 0
a
v 0.3
n 0.02
a 0
find
n
2
1
3
n
2
1
3
a
1
0
1
…
n
v 0.3
n 0
a 0.1
preferred
v 0.2
n 0.2
a 0
tags
51
Local factors in a graphical model

First, a familiar example

Conditional Random Field (CRF) for POS tagging
p(v a n) is proportional
to the product of all
factors’ values on v a n
…
v
v 0
n 2
a 0
v
a
1
0
1
v
v 0
n 2
a 0
a
v 0.3
n 0.02
a 0
find
n
2
1
3
n
2
1
3
a
1
0
1
= … 1*3*0.3*0.1*0.2 …
…
n
v 0.3
n 0
a 0.1
preferred
v 0.2
n 0.2
a 0
tags
52
Local factors in a graphical model

First, a familiar example


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
…
find
preferred
links
…
53
Local factors in a graphical model

First, a familiar example


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
Possible parse — encoded as an assignment to these vars
t
f
f
f
f
…
find
t
preferred
links
…
54
Local factors in a graphical model

First, a familiar example


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
Possible parse — encoded as an assignment to these vars
Another possible parse
f
f
t
t
f
…
find
f
preferred
links
…
55
Local factors in a graphical model

First, a familiar example


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
Possible parse — encoded as an assignment to these vars
Another possible parse
An illegal parse
f
t
t
f
…
find
t
(cycle)
preferred
f
links
…
56
Local factors in a graphical model

First, a familiar example


v
a
n
CRF for POS tagging
Now let’s do dependency parsing!

O(n2) boolean variables for the possible links
Possible parse — encoded as an assignment to these vars
Another possible parse
An illegal parse
f
Another illegal parse t
t
f
…
find
t
(cycle)
t
preferred
(multiple parents)
links
…
57
Local factors for parsing

So what factors shall we multiply to define parse probability?

Unary factors to evaluate each link in isolation
But what if the best
assignment isn’t a tree??
as before, goodness
of this link can
t 2
depend on entire
f 1
observed input context
t 1
f 2
t 1
f 8
t 1
f 3
…
find
t 1
f 6
preferred
t 1
f 2
some other links
aren’t as good
given this input
sentence
links
…
58
Global factors for parsing

So what factors shall we multiply to define parse probability?


Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form a legal tree

…
this is a “hard constraint”: factor is either 0 or 1
find
preferred
ffffff
ffffft
fffftf
…
fftfft
…
tttttt
0
0
0
…
1
…
0
links
…
59
Global factors for parsing

optionally require the
tree to be projective
crossing
links)
So what factors shall we multiply to define (no
parse
probability?
 Unary factors to evaluate each link in isolation
 Global TREE factor to require that the links form a legal tree

this is a “hard constraint”: factor is either 0 or 1
So far, this is equivalent to
edge-factored parsing
(McDonald et al. 2005).
t
f
f
f
f
…
find
t
preferred
ffffff
ffffft
fffftf
…
fftfft
…
tttttt
0
0
0
…
1
…
0
we’re
legal!
64 entries (0/1)
links
…
Note: McDonald et al. (2005) don’t loop through this table
to consider exponentially many trees one at a time.
They use combinatorial algorithms; so should we! 60
Local factors for parsing

So what factors shall we multiply to define parse probability?


Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form a legal tree


this is a “hard constraint”: factor is either 0 or 1
Second-order effects: factors on 2 variables

grandparent
t
f
t
f
1
1
t
1 3
t
…
find
preferred
links
…
61
Local factors for parsing

So what factors shall we multiply to define parse probability?


Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form a legal tree


this is a “hard constraint”: factor is either 0 or 1
Second-order effects: factors on 2 variables


grandparent
no-cross
t
f
t
f
1
1
t
1 0.2
t
…
find
preferred
links
by
…
62
Local factors for parsing

So what factors shall we multiply to define parse probability?


Unary factors to evaluate each link in isolation
Global TREE factor to require that the links form a legal tree


Second-order effects: factors on 2 variables






…
this is a “hard constraint”: factor is either 0 or 1
grandparent
no-cross
siblings
hidden POS tags
subcategorization
…
find
preferred
links
by
…
63
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
64
Good to have lots of features, but …

Nice model 
Shame about the NP-hardness 

Can we approximate?

Machine learning to the rescue!



ML community has given a lot to NLP
In the 2000’s, NLP has been giving back to ML

Mainly techniques for joint prediction of structures

Much earlier, speech recognition had HMMs, EM, smoothing …
65
Great Ideas in ML: Message Passing
Count the soldiers
there’s
1 of me
1
before
you
2
before
you
3
before
you
4
before
you
5
behind
you
4
behind
you
3
behind
you
2
behind
you
adapted from MacKay (2003) textbook
5
before
you
1
behind
you
66
Great Ideas in ML: Message Passing
Count the soldiers
there’s
1 of me
2
before
you
Belief:
Must be
22 +11 +33 =
6 of us
3
only see
my incoming behind
you
messages
adapted from MacKay (2003) textbook
67
Great Ideas in ML: Message Passing
Count the soldiers
there’s
1 of me
1
before
you
Belief:
Belief:
Must be
Must be
1 1 +11 +44 = 22 +11 +33 =
6 of us
6 of us
4
only see
my incoming behind
you
messages
adapted from MacKay (2003) textbook
68
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
1 of me
11 here
(= 7+3+1)
adapted from MacKay (2003) textbook
69
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
(= 3+3+1)
3 here
adapted from MacKay (2003) textbook
70
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
11 here
(= 7+3+1)
7 here
3 here
adapted from MacKay (2003) textbook
71
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
3 here
adapted from MacKay (2003) textbook
Belief:
Must be
14 of us
72
Great Ideas in ML: Message Passing
Each soldier receives reports from all branches of tree
3 here
7 here
3 here
adapted from MacKay (2003) textbook
Belief:
Must be
14 of us
73
Great ideas in ML: Forward-Backward

In the CRF, message passing = forward-backward
belief
message
α
…
v
v 0
n 2
a 0
v 7
n 2
a 1
n
2
1
3
α
v 1.8
n 0
a 4.2
av 3
1n 1
0a 6
1
β
message
v 2v
n v1 0
a n7 2
a 0
n
2
1
3
β
a
1
0
1
v 3
n 6
a 1
…
v 0.3
n 0
a 0.1
find
preferred
tags
74
Great ideas in ML: Forward-Backward

Extend CRF to “skip chain” to capture non-local factor

More influences on belief 
α
v 5.4
n 0
a 25.2
β
v 3
n 1
a 6
…
v 3
n 1
a 6
find
v 2
n 1
a 7
…
v 0.3
n 0
a 0.1
preferred
tags
75
Great ideas in ML: Forward-Backward

Extend CRF to “skip chain” to capture non-local factor


More influences on belief 
Red messages not independent?
v 5.4`
Graph becomes loopy 
Pretend they are!
α
n 0
a 25.2`
β
v 3
n 1
a 6
…
v 3
n 1
a 6
find
v 2
n 1
a 7
…
v 0.3
n 0
a 0.1
preferred
tags
76
Two great tastes that taste great together

Upcoming attractions …
You got belief
propagation in
my dynamic
programming!
You got
dynamic
programming in
my belief
propagation!
77
Loopy Belief Propagation for Parsing






Sentence tells word 3, “Please be a verb”
Word 3 tells the 3  7 link, “Sorry, then you probably don’t exist”
The 3  7 link tells the Tree factor, “You’ll have to find another
parent for 7”
The tree factor tells the 10  7 link, “You’re on!”
The 10  7 link tells 10, “Could you please be a noun?”
…
…
find
preferred
links
…
78
Loopy Belief Propagation for Parsing

Higher-order factors (e.g., Grandparent) induce loops


…
Let’s watch a loop around one triangle …
Strong links are suppressing or promoting other links …
find
preferred
links
…
79
Loopy Belief Propagation for Parsing

Higher-order factors (e.g., Grandparent) induce loops


Let’s watch a loop around one triangle …
How did we compute outgoing message to green link?

“Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
?
TREE factor
…
find
ffffff
0
ffffft
0
fffftf
0
…
…
fftfft
1
…
…
preferred
tttttt
0
links
…
80
Loopy Belief Propagation for Parsing

How did we compute outgoing message to green link?

“Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
TREE factor
But this is the outside probability of green link!
ffffff
0
ffffft
0
TREE factor computes all outgoing messages
at
once
fffftf
0
(given all incoming messages)
…
…
fftfft
1
…
…
Projective case: total O(n3) time by inside-outside
tttttt
0
Non-projective: total O(n3) time by inverting Kirchhoff
?
…
find
preferred
links
…
matrix (Smith & Smith, 2007)
81
Loopy Belief Propagation for Parsing

How did we compute outgoing message to green link?

“Does the TREE factor think that the green link is probably t,
given the messages it receives from all the other links?”
But this is the outside probability of green link!
TREE factor computes all outgoing messages at once
(given all incoming messages)
Projective case: total O(n3) time by inside-outside
Non-projective: total O(n3) time by inverting Kirchhoff
matrix (Smith & Smith, 2007)
Belief propagation assumes incoming messages to TREE are independent.
So outgoing messages can be computed with first-order parsing algorithms
(fast, no grammar constant).
82
Some connections …

Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)

Global constraints in arc consistency


Matching constraint in max-product BP



ALLDIFFERENT constraint (Régin 1994)
For computer vision (Duchi et al., 2006)
Could be used for machine translation
As far as we know, our parser is the first use of
global constraints in sum-product BP.
83
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
84
Runtimes for each factor type (see paper)
Factor type
Tree
Proj. Tree
degree runtime
O(n2)
O(n3)
O(n2)
O(n3)
count total
1
O(n3)
1
O(n3)
Individual links
1
O(1)
O(n2)
O(n2)
Grandparent
2
O(1)
O(n3)
O(n3)
Sibling pairs
2
O(1)
O(n3)
O(n3)
Sibling bigrams O(n)
O(n2)
O(n)
O(n3)
NoCross
O(n)
O(n)
O(n2)
O(n3)
Tag
1
O(g)
O(n)
O(n)
TagLink
3
O(g2)
O(n2)
O(n2)
TagTrigram
O(n)
O(ng3)
1
TOTAL
Additive, not multiplicative!
+ O(n)
= O(n )
3
per
iteration
85
Runtimes for each factor type (see paper)
Factor type
Tree
Proj. Tree
degree runtime
O(n2)
O(n3)
O(n2)
O(n3)
count total
1
O(n3)
1
O(n3)
Individual links
1
O(1)
O(n2)
O(n2)
Grandparent
2
O(1)
O(n3)
O(n3)
Sibling pairs
2
O(1)
O(n3)
O(n3)
Sibling bigrams O(n)
O(n2)
O(n)
O(n3)
NoCross
O(n)
O(n)
O(n2)
O(n3)
Tag
1
O(g)
O(n)
O(n)
TagLink
3
O(g2)
O(n2)
O(n2)
TagTrigram
O(n)
O(ng3)
1
+ O(n)
Each
“global” factorAdditive,
coordinates
an unbounded # of variables
TOTAL
O(n )
not multiplicative!
=
Standard belief propagation would take exponential time
3
to iterate over all configurations of those variables
See paper for efficient propagators
86
Experimental Details

Decoding





Training



Run several iterations of belief propagation
Get final beliefs at link variables
Feed them into first-order parser
This gives the Min Bayes Risk tree (minimizes expected error)
BP computes beliefs about each factor, too …
… which gives us gradients for max conditional likelihood.
(as in forward-backward algorithm)
Features used in experiments


First-order: Individual links just as in McDonald et al. 2005
Higher-order: Grandparent, Sibling bigrams, NoCross
87
Dependency Accuracy
The extra, higher-order features help! (non-projective parsing)
Tree+Link
Danish
85.5
Dutch
87.3
English
88.6
+NoCross
86.1
88.3
89.1
+Grandparent
86.1
88.6
89.4
+ChildSeq
86.5
88.5
90.1
88
Dependency Accuracy
The extra, higher-order features help! (non-projective parsing)
Tree+Link
Danish
85.5
Dutch
87.3
English
88.6
+NoCross
86.1
88.3
89.1
+Grandparent
86.1
88.6
89.4
+ChildSeq
86.5
88.5
90.1
86.0
84.5
90.2
86.1
87.6
90.2
exact, slow Best projective
parse with all factors
doesn’t fix
+hill-climbing
enough edges
89
Time vs. Projective Search Error
iterations
iterations
…DP 140
iterations
Compared with O(n4) DP
Compared with O(n5) DP
90
Outline

Edge-factored parsing







Dependency parses
Scoring the competing parses: Edge features
Finding the best parse
Higher-order parsing

Old
New!
Throwing in more features: Graphical models
Finding the best parse: Belief propagation
Experiments
Conclusions
92
Freedom Regained
This paper in a nutshell

Output probability defined as product of local and global factors



Let local factors negotiate via “belief propagation”




Throw in any factors we want! (log-linear model)
Each factor must be fast, but they run independently
Each bit of syntactic structure is influenced by others
Some factors need combinatorial algorithms to compute messages fast

e.g., existing parsing algorithms using dynamic programming

Compare reranking or stacking
Each iteration takes total time O(n3) or even O(n2); see paper
Converges to a pretty good (but approximate) global parse


Fast parsing for formerly intractable or slow models
Extra features of these models really do help accuracy
93
Future Opportunities

Efficiently modeling more hidden structure


Beyond dependencies


Constituency parsing, traces, lattice parsing
Beyond parsing




POS tags, link roles, secondary links (DAG-shaped parses)
Alignment, translation
Bipartite matching and network flow
Joint decoding of parsing and other tasks (IE, MT, reasoning ...)
Beyond text


Image tracking and retrieval
Social networks
94
thank
you
95