Weighted Deduction as a Programming Language Jason Eisner co-authors on various parts of this work: Eric Goldlust, Noah A.

Download Report

Transcript Weighted Deduction as a Programming Language Jason Eisner co-authors on various parts of this work: Eric Goldlust, Noah A.

Weighted Deduction
as a Programming Language
Jason Eisner
co-authors on various parts of this work:
Eric Goldlust, Noah A. Smith, John Blatz, Wes Filardo, Wren Thornton
CMU and Google, May 2008
1
An Anecdote from ACL’05
-Michael Jordan
2
An Anecdote from ACL’05
-Michael Jordan
Just draw a model that actually
makes sense for your problem.
Just do Gibbs sampling,
Um, it’s only 6 lines in Matlab…
if it’s fast enough.
3
Conclusions to draw from that talk
2.
Mike & his students are great.
Graphical models are great.
3.
Gibbs sampling is great.
4.
Matlab is great.
1.
(because they’re flexible)
(because it works with nearly any graphical model)
(because it frees up Mike and his students to
doodle all day and then execute their doodles)
4
2.
Mike & his students are great.
Graphical models are great.
3.
Gibbs sampling is great.
4.
Matlab is great.
1.
(because they’re flexible)
(because it works with nearly any graphical model)
(because it frees up Mike and his students to
doodle all day and then execute their doodles)
5
Systems are big!
Large-scale noisy data, complex models, search approximations, software engineering
NLP sys
files code (lines) comments lang (primary) purpose
SRILM
308
49879
14083 C++
LM
LingPipe
502
49967
47515 Java
LM/IE
Charniak
parser
259
53583
8057 C++
Parsing
Stanford
parser
373
121061
24486 Java
Parsing
GenPar
986
77922
12757 C++
Parsing/MT
MOSES
305
42196
6946 Perl, C++,…
MT
GIZA++
124
16116
2575 C++
MT alignment
6
Systems are big!
Large-scale noisy data, complex models, search approximations, software engineering


Maybe a bit smaller outside NLP
But still big and carefully engineered

And will get bigger, e.g., as machine vision systems do
more scene analysis and compositional object modeling
System
files code
comments lang purpose
ProbCons
15
4442
693 C++ MSA of amino acid seqs
MUSTANG
50
7620
3524 C++ MSA of protein structures
MELISMA
44
7541
1785 C
Dynagraph
218 20246
Music analysis
4505 C++ Graph layout
7
Systems are big!
Large-scale noisy data, complex models, search approximations, software engineering
Consequences:

Barriers to entry




Barriers to experimentation


Small number of players
Significant investment to be taken seriously
Need to know & implement the standard tricks
Too painful to tear up and reengineer your old
system, to try a cute idea of unknown payoff
Barriers to education and sharing


Hard to study or combine systems
Potentially general techniques are described
and implemented only one context at a time
8
How to spend one’s life?
Didn’t I just implement something
like this last month?
chart management / indexing
cache-conscious data structures
memory layout, file formats, integerization, …
prioritization of partial solutions (best-first, A*)
lazy k-best, forest reranking
parameter management
inside-outside formulas, gradients, …
different algorithms for training and decoding
conjugate gradient, annealing, ...
parallelization
I thought computers were supposed to automate drudgery
9
Solution

Presumably, we ought to
add another layer of
abstraction.



After all, this is CS.
Hope to convince you that
a substantive new layer exists.
But what would it look like?

What’s shared by many programs?
10
Can toolkits help?
NLP tool
files
code
comments lang
HTK
111
88865
OpenFST
150
20502
1180 C++
Weighted FSTs
TIBURON
53
13791
4353 Java
Tree transducers
163
58475
5853 C++
Annotation of time series
UIMA
1577 154547
110183 Java
Unstructured-data mgmt
GATE
1541
79128
NLTK
258
60661
9093 Python NLP algs (educational)
libbow
122
42061
9198 C
MALLET
559
73859
18525 Java
90
12584
3286 Java
AGLIB
GRMM
14429 C
purpose
42848 Java
HMM for ASR
Text engineering mgmt
IR, textcat, etc.
CRFs and classification
Graphical models add-on
11
Can toolkits help?

Hmm, there are a lot of toolkits.


And they’re big too.
Plus, they don’t always cover what you want.


Which is why people keep writing them.
E.g., I love & use OpenFST and have learned lots from its
implementation! But sometimes I also want ...






automata with > 2 tapes
infinite alphabets
parameter training
A* decoding
automatic integerization




automata defined “by policy”
mixed sparse/dense
implementation (per state)
parallel execution
hybrid models (90% finite-state)
So what is common across toolkits?
12
The Dyna language

A toolkit’s job is to abstract away
the semantics, operations, and algorithms
for a particular domain.

In contrast, Dyna is domain-independent.



(like MapReduce, Bigtable, etc.)
Manages data & computations that you specify.
Toolkits or applications can be built on top.
13
Warning


Lots more beyond this talk
See http://dyna.org
read our papers
download an earlier prototype
sign up for updates by email
wait for the totally revamped next version 
14
A Quick Sketch of Dyna
15
How you build a system (“big picture” slide)
cool model
PCFG
practical equations
 y (i, j ) z ( j, k )
 x i, k   
pN x  N y N z | N x 
0i  j  k  n
...
pseudocode
(execution order)
for width from 2 to n
for i from 0 to n-width
k = i+width
for j from i+1 to k-1
…
tuned C++
implementation
(data structures, etc.)
16
How you build a system (“big picture” slide)
cool model
PCFG
practical equations
 y (i, j ) z ( j, k )
 x i, k   
pN x  N y N z | N x 
0i  j  k  n
...
Dyna language specifies these equations.
Most programs
just need to compute some
pseudocode
values from
other values.
(execution
order) Any order is ok.
tuned C++
for width from 2 to n
Feed-forward!
implementation
for i from
0 to n-width
Dynamic
programming!
(data structures, etc.)
k = i+width
Message
passing! (including Gibbs)
for j from i+1 to k-1
Must quickly figure
… out what influences what.
Compute Markov blanket
Compute transitions in state machine
17
How you build a system (“big picture” slide)
cool model
PCFG
practical equations
 y (i, j ) z ( j, k )
 x i, k   
pN x  N y N z | N x 
0i  j  k  n
...
Dyna language specifies these equations.
Most programs
just need to compute some
pseudocode
values from
other values.
(execution
order) Any order is ok.
tuned C++
for width from 2 to n
Some programs
need to updateimplementation
the outputs
for i from 0also
to n-width
(data structures, etc.)
if the inputs
change:
k = i+width
 spreadsheets,
makefiles,
email readers
for j from i+1
to k-1
… algorithms
 dynamic graph
 EM and other iterative optimization
 Energy of a proposed configuation for MCMC
18
 leave-one-out training of smoothing params
How you build a system (“big picture” slide)
cool model
PCFG
practical equations
 y (i, j ) z ( j, k )
 x i, k   
pN x  N y N z | N x 
0i  j  k  n
...
Compilation strategies
(we’ll come back to this)
pseudocode
(execution order)
for width from 2 to n
for i from 0 to n-width
k = i+width
for j from i+1 to k-1
…
tuned C++
implementation
(data structures, etc.)
19
Writing equations in Dyna




int a.
a = b * c.
a will be kept up to date if b or c changes.
b += x.
b += y.
equivalent to b = x+y.
b is a sum of two variables. Also kept up to date.
c += z(1).
a “pattern”
c += z(2).
c += z(N). the capitalized N
c += z(3).
matches anything
c += z(“four”).
c is a sum of all
c += z(foo(bar,5)). nonzero z(…) values.
At compile time, we
don’t know how many!
20
More interesting use of patterns

a = b * c.

scalar multiplication
a(I) = b(I) * c(I).

pointwise multiplication
a += b(I) * c(I). means a =

sparse dot product of query & document
... + b(“yetis”)*c(“yetis”)
+ b(“zebra”)*c(“zebra”)



dot product; could be sparse
a(I,K) += b(I,J) * c(J,K).


 b(I)*c(I)
I
 b(I,J)*c(J,K)
J
matrix multiplication; could be sparse
J is free on the right-hand side, so we sum over it
21
Dyna vs. Prolog
By now you may see what we’re up to!
Prolog has Horn clauses:
a(I,K) :- b(I,J) , c(J,K).
Dyna has “Horn equations”:
a(I,K) += b(I,J) * c(J,K).
prove a value for it
e.g., a real number,
but could be any term
Like Prolog:
definition from other values
b*c only has value when b and c do
if no values enter into +=, then a gets no value
Allow nested terms
Syntactic sugar for lists, etc.
Turing-complete
Unlike Prolog:
Charts, not backtracking!
Compile  efficient C++ classes
Terms have values
22
Some connections and intellectual debts …

Deductive parsing schemata (preferably weighted)


Deductive databases (preferably with aggregation)




Ramakrishnan, Zukowski, Freitag, Specht, Ross, Sagiv, …
Query optimization
Usually limited to decidable fragments, e.g., Datalog
Theorem proving



Goodman, Nederhof, Pereira, McAllester, Warren, Shieber,
Schabes, Sikkel…
Theorem provers, term rewriting, etc.
Nonmonotonic reasoning
Programming languages






Increasing interest in
resurrecting declarative and
logic-based system
specifications.
Efficient Prologs (Mercury, XSB, …)
Probabilistic programming languages (PRISM, IBAL …)
Declarative networking (P2)
XML processing languages (XTatic, CDuce)
Functional logic programming (Curry, …)
Self-adjusting computation, adaptive memoization (Acar et al.)
23
Example:
CKY and Variations
24
The CKY inside algorithm in Dyna
phrase(X,I,J) += rewrite(X,W) * word(W,I,J).
phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
goal
+= phrase(“s”,0,sentence_length).
using namespace cky;
chart c;
put in axioms
(values not
defined by
the above
program)
theorem
pops out
c[rewrite(“s”,“np”,“vp”)] = 0.7;
c[word(“Pierre”,0,1)] = 1;
c[sentence_length] = 30;
cin >> c;
// get more axioms from stdin
cout << c[goal]; // print total weight of all parses
25
Visual debugger: Browse the proof forest
desired theorem
ambiguity
dead end
shared substructure
(dynamic programming)
axioms
26
Visual debugger: Browse the proof forest
ambiguity
dead end
shared substructure
(dynamic programming)
27
Parameterization …
phrase(X,I,J)
phrase(X,I,J)
goal

+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
rewrite(X,Y,Z) doesn’t have to be an atomic parameter:






urewrite(X,Y,Z) *= weight1(X,Y).
urewrite(X,Y,Z) *= weight2(X,Z).
urewrite(X,Y,Z) *= weight3(Y,Z).
urewrite(X,Same,Same) *= weight4.
urewrite(X) += urewrite(X,Y,Z).
% normalizing constant
rewrite(X,Y,Z) = urewrite(X,Y,Z) / urewrite(X). % normalize
28
Related algorithms in Dyna?
phrase(X,I,J)
phrase(X,I,J)
goal








+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
29
Related algorithms in Dyna?
phrase(X,I,J) max=
+= rewrite(X,W) * word(W,I,J).
phrase(X,I,J) max=
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
goal
+= phrase(“s”,0,sentence_length).
max=








Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
30
Related algorithms in Dyna?
log+=
phrase(X,I,J) max=
+= rewrite(X,W) +* word(W,I,J).
phrase(X,I,J) log+=
max=
+= rewrite(X,Y,Z) +* phrase(Y,I,Mid) *+phrase(Z,Mid,J).
goal
+= phrase(“s”,0,sentence_length).
max=
log+=








Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
31
Related algorithms in Dyna?
phrase(X,I,J)
phrase(X,I,J)
goal








+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
Viterbi parsing?
Logarithmic domain?
Lattice parsing?
c[ word(“Pierre”, state(5)0, state(9)1) ] = 10.2
Incremental (left-to-right) parsing?
Log-linear parsing?
8
9
Lexicalized or synchronous parsing?
Binarized CKY?
5
Earley’s algorithm?
32
Related algorithms in Dyna?
phrase(X,I,J)
phrase(X,I,J)
goal








+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
Viterbi parsing?
Logarithmic domain?
Just add words one at a time
to the chart
Lattice parsing?
Check at any time what can
Incremental (left-to-right) parsing? be derived from words so far
Log-linear parsing?
Similarly, dynamic grammars
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
33
Related algorithms in Dyna?
phrase(X,I,J)
phrase(X,I,J)
goal








+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Again, no change to the Dyna program
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
34
Related algorithms in Dyna?
phrase(X,I,J)
phrase(X,I,J)
goal








+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
Basically, just add extra
arguments to the terms above
35
Related algorithms in Dyna?
phrase(X,I,J)
phrase(X,I,J)
goal








+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
36
Rule binarization
phrase(X,I,J)
+= phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z).
folding transformation: asymp. speedup!
temp(X\Y,Mid,J) +=
phrase(Z,Mid,J) * rewrite(X,Y,Z).
phrase(X,I,J) += phrase(Y,I,Mid) * temp(X\Y,Mid,J).
X
Y
Z
Y
I
X\Y
Z
Mid Mid
X
Y
J
I
Mid Mid
J
I
J
37
Rule binarization
phrase(X,I,J)
+= phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z).
folding transformation: asymp. speedup!
temp(X\Y,Mid,J) +=
phrase(Z,Mid,J) * rewrite(X,Y,Z).
phrase(X,I,J) += phrase(Y,I,Mid) * temp(X\Y,Mid,J).

phrase(Y,I,Mid) * phrase(Z,Mid,J) * rewrite(X,Y,Z)
Y , Z , Mid


phrase(Y,I,Mid)
Y , Mid
Z
graphical models
constraint programming
multi-way database join
phrase(Z,Mid,J) * rewrite(X,Y,Z)
38
Program transformations
cool model
PCFG
practical equations
 y (i, j ) z ( j, k )
 x i, k   
0i  j  k  n
pN x  N y N z | N x 
Blatz & Eisner ...(FG 2007):
Lots of equivalent ways to write
a system
of equations!
pseudocode
(execution order)
tuned C++
Transforming
fromfrom
one2 to
for width
to nanother may
implementation
improve
efficiency.
for i from
0 to n-width
(data structures, etc.)
k = i+width
j from
to k-1
Many parsing “tricks”forcan
bei+1
generalized
into
… other programs, too!
automatic transformations that help
39
Related algorithms in Dyna?
phrase(X,I,J)
phrase(X,I,J)
goal








+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
40
Earley’s algorithm in Dyna
phrase(X,I,J)
phrase(X,I,J)
goal
+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
magic templates transformation
(as noted by Minnen 1996)
need(“s”,0) = true.
need(Nonterm,J) :- phrase(_/[Nonterm|_],_,J).
phrase(Nonterm/Needed,I,I)
+= need(Nonterm,I), rewrite(Nonterm,Needed).
phrase(Nonterm/Needed,I,K)
+= phrase(Nonterm/[W|Needed],I,J) * word(W,J,K).
phrase(Nonterm/Needed,I,K)
+= phrase(Nonterm/[X|Needed],I,J) * phrase(X/[],J,K).
goal += phrase(“s”/[],0,sentence_length).
41
Related algorithms in Dyna?
phrase(X,I,J)
phrase(X,I,J)
goal









+= rewrite(X,W) * word(W,I,J).
+= rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
+= phrase(“s”,0,sentence_length).
Viterbi parsing?
Logarithmic domain?
Lattice parsing?
Incremental (left-to-right) parsing?
Log-linear parsing?
Lexicalized or synchronous parsing?
Binarized CKY?
Earley’s algorithm?
Epsilon symbols? word(epsilon,I,I) = 1.
(i.e., epsilons are freely available everywhere)
42
Some examples from my lab (as of 2006,
 Parsing using …
w/prototype)…
factored dependency models
(Dreyer, Smith, & Smith CONLL’06)

with annealed risk minimization
(Smith and Eisner EMNLP’06)
constraints on dependency length
(Eisner & Smith IWPT’05)
unsupervised learning of deep transformations
(see Eisner EMNLP’02)
lexicalized algorithms
(see Eisner & Satta ACL’99, etc.)





Grammar induction using …
Programs are very
short & easy to
Machine translation using … change!





Easy to try stuff out!



partial supervision
structural annealing
contrastive estimation
deterministic annealing
(Dreyer & Eisner EMNLP’06)
(Smith & Eisner ACL’06)
(Smith & Eisner GIA’05)
(Smith & Eisner ACL’04)
Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06)
Loosely syntax-based MT
(Smith & Eisner in prep.)
Synchronous cross-lingual parsing (Smith & Smith EMNLP’04) - see also Eisner ACL’03)

Finite-state methods for morphology, phonology, IE, even syntax …

Unsupervised cognate discovery
(Schafer & Yarowsky ’05, ’06)

Unsupervised log-linear models via contrastive estimation (Smith & Eisner ACL’05)

Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05)

Trainable (in)finite-state machines
(see Eisner ACL’02, EMNLP’02, …)

Finite-state machines with very large alphabets (see Eisner ACL’97)

Finite-state machines over weird semirings
(see Eisner ACL’02, EMNLP’03)
43
Teaching
(Eisner JHU’05-06; Smith & Tromble JHU’04)

Can it express everything in NLP? 

Remember, it integrates tightly with C++,
so you only have to use it where it’s helpful,
and write the rest in C++. Small is beautiful.

Of course, it is Turing complete … 
44
One Execution Strategy
(forward chaining)
45
How you build a system (“big picture” slide)
cool model
PCFG
practical equations
 y (i, j ) z ( j, k )
 x i, k   
pN x  N y N z | N x 
0i  j  k  n
...
Propagate updates
pseudocode
from right-to-left
(execution order)
through the equations.
for width from 2 to n
a.k.a.
for i from 0 touse
n-width
a
“agenda algorithm”
k = i+widthgeneral
“forward chaining”
for j from i+1
to k-1
method
“bottom-up inference”
…
“semi-naïve bottom-up”
tuned C++
implementation
(data structures, etc.)
46
Bottom-up inference
agenda of pending updates
rules of program
pp(I,K) += prep(I,J)
s(I,K) +=* np(I,J)
np(J,K) * vp(J,K)
prep(I,3)
pp(2,5)
prep(2,3)
s(3,9)
s(3,7)
vp(5,K)
vp(5,9)
np(3,5) vp(5,7)
?
+= 0.3
+===
0.15
0.21
1.0
0.5
?
+= 0.3
==0.7
we updated np(3,5);
what else must therefore change?
no more matches
np(3,5)
prep(I,3)
vp(5,K)
to this query
= 0.1+0.3
?
?
0.4
If np(3,5) hadn’t been
in the chart already,
we would have added it.
chart of derived items with current values
47
How you build a system (“big picture” slide)
cool model
PCFG
practical equations
 y (i, j ) z ( j, k )
 x i, k   
What’s going
on under the
hood?
pN x  N y N z | N x 
0i  j  k  n
...
pseudocode
(execution order)
for width from 2 to n
for i from 0 to n-width
k = i+width
for j from i+1 to k-1
…
tuned C++
implementation
(data structures, etc.)
48
Compiler provides …
agenda of pending updates
efficient priority queue
s(I,K) += np(I,J) * vp(J,K)
np(3,5)
copy, compare, & hash
terms fast, via
+= 0.3
rules of program
hard-coded
pattern matching
integerization (interning)
automatic indexing
for O(1) lookup
vp(5,K)?
chart of derived items with current values
efficient storage of terms
(given static type info)
(implicit storage,
“symbiotic” storage,
various data structures,
support for indices,
stack vs. heap, …)
49
Beware double-counting!
agenda of pending updates
combining
with itself
rules of program
n(I,K) += n(I,J) * n(J,K)
n(5,5)
n(5,5)
to make n(5,5)
= 0.2
+= 0.3
another copy += ?
of itself
epsilon
constituent
n(5,K)?
chart of derived items with current values
50
More issues in implementing inference

Handling non-distributive updates

Replacement


what if q(0) becomes unprovable (no value)?
p += 1/q(X).
adding Δ to q(0) doesn’t simply add to p
Backpointers (hyperedges in the derivation forest)


p max= q(X).
Non-distributive rules


what if current max q(0) is reduced?
Retraction


p max= q(X).
Efficient storage, or on-demand recomputation
Information flow between f(3), f(int X), f(X)
51
More issues in implementing inference

User-defined priorities


priority(phrase(X,I,J)) = -(J-I). CKY (narrow to wide)
priority(phrase(X,I,J)) = phrase(X,I,J). uniform-cost
+ heuristic(X,I,J)
A*
Can we learn a good priority function? (can be dynamic)

User-defined parallelization

host(phrase(X,I,J)) = J.
Can we learn a host choosing function? (can be dynamic)

User-defined convergence tests
52
More issues in implementing inference

Time-space tradeoffs
Which queries to index, and how?
 Selective or temporary memoization
Can we learn a policy?


On-demand computation (backward chaining)
Prioritizing subgoals; query planning
 Safely invalidating memos`
 Mixing forward-chaining and backward-chaining
Can we choose a good mixed strategy?

53
Parameter training



Maximize some objective function.
Use Dyna to compute the function.
Then how do you differentiate it?



… for gradient ascent,
conjugate gradient, etc.
… gradient of log-partition
function also tells us the
expected counts for EM
objective function
as a theorem’s value
e.g., inside algorithm
computes likelihood
of the sentence
Two approaches supported:


model parameters
(andand
input
sentence)
Tape algorithm – remember agenda order
run
it “backwards.”
as axiom values
Program transformation – automatically derive the “outside” formulas.
54
Automatic differentiation via the gradient transform

a += b * c. 


g(b) += a * g(c).
g(a) += g(b) * c.
Now g(x) denotes ∂f/∂x, f being the objective func.

Examples:



Dyna implementation
also supports “tape”based differentiation.
Backprop for neural networks
Backward algorithm for HMMs and CRFs
Outside algorithm for PCFGs
55
More on Program
Transformations
56
Program transformations

An optimizing compiler would like the freedom to
radically rearrange your code.

Easier in a declarative language than in C.



Don’t need to reconstruct the source program’s intended
semantics.
Also, source program is much shorter.
Search problem (open): Find a good sequence of
transformations (on a given workload).
57
Variable elimination via a folding transform

Undirected graphical model:
=


goal max= f1(A,B)*f2(A,C)*f3(A,D)*f4(C,E)*f5(D,E).
tempE(C,D)
tempE(C,D) max= f4(C,E)*f5(D,E).
to eliminate E,
join constraints mentioning E,
and project E out
59
figure thanks to Rina Dechter
Variable elimination via a folding transform

Undirected graphical model:
=


goal max= f1(A,B)*f2(A,C)*f3(A,D)*tempE(C,D).
tempD(A,C)

tempD(A,C) max= f3(A,D)*tempE(C,D). to eliminate D,
join constraints mentioning D,
tempE(C,D) max= f4(C,E)*f5(D,E).
and project D out

60
figure thanks to Rina Dechter
Variable elimination via a folding transform

Undirected graphical model:
=

=




goal max= f1(A,B)*f2(A,C)*tempD(A,C).
tempC(A)
tempC(A) max= f2(A,C)*tempD(A,C).
tempD(A,C) max= f3(A,D)*tempE(C,D).
tempE(C,D) max= f4(C,E)*f5(D,E).
61
figure thanks to Rina Dechter
Variable elimination via a folding transform

Undirected graphical model:
=

=





goal max= tempC(A)*f1(A,B). tempB(A)
tempB(A) max= f1(A,B).
tempC(A) max= f2(A,C)*tempD(A,C).
tempD(A,C) max= f3(A,D)*tempE(C,D).
tempE(C,D) max= f4(C,E)*f5(D,E).
62
figure thanks to Rina Dechter
Variable elimination via a folding transform

Undirected graphical model:
=

=





goal max= tempC(A)*tempB(A).
could replace max=
with += throughout,
tempB(A) max= f1(A,B).
tempC(A) max= f2(A,C)*tempD(A,C). to compute partition
function Z
tempD(A,C) max= f3(A,D)*tempE(C,D).
tempE(C,D) max= f4(C,E)*f5(D,E).
63
figure thanks to Rina Dechter
Grammar specialization as an unfolding transform


phrase(X,I,J) += rewrite(X,Y,Z) * phrase(Y,I,Mid) * phrase(Z,Mid,J).
rewrite(“s”,“np”,“vp”) += 0.7. unfolding
phrase(“s”,I,J) += 0.7 * phrase(“np”,I,Mid) * phrase(“vp”,Mid,J).
term flattening
s(I,J) += 0.7 * np(I,Mid) * vp(Mid,J).
(actually handled
implicitly by
subtype storage
declarations)
64
On-demand computation
via a “magic templates” transform

a :- b, c. 




Examples:




a :- magic(a), b, c.
magic(b) :- magic(a).
magic(c) :- magic(a), b.
Earley’s algorithm for parsing
Left-corner filter for parsing
On-the-fly composition of FSTs
The weighted generalization turns out to be
the “generalized A*” algorithm (coarse-to-fine search).
65
Speculation transformation
(generalization of folding)
 Perform some portion of computation speculatively,
before we have all the inputs we need


Fill those inputs in later
Examples from parsing:

Gap passing in categorial grammar


Transform a parser so that it preprocesses the grammar



Build an S/NP (a sentence missing its direct object NP)
E.g., unary rule closure or epsilon closure
Build phrase(“np”,I,K) from a phrase(“s”,I,K) we don’t have yet
(so we haven’t yet chosen a particular I, K)
Transform lexical context-free parsing from O(n5)  O(n3)


Add left children to a constituent we don’t have yet
(without committing to its width)
Derive Eisner & Satta (1999) algorithm
66
A few more language details
So you’ll understand the examples …
67
Terms (generalized from Prolog)


These are the “Objects” of the language
Primitives:



Variables:



3, 3.14159, “myUnicodeString”
user-defined primitive types
X
int X
[type-restricted variable; types are tree automata]
Compound terms:



atom
atom(subterm1, subterm2, …) e.g., f(g(h(3),X,Y), Y)
Adding support for keyword arguments
(similar to R, but must support unification)
68
Fixpoint semantics

A Dyna program is a finite rule set
that defines a partial function (“map”)
ground terms
(variable-free)
values
(also terms)
weight(feature(word=“pet”,tag=“Noun”))
northeast(point(3,6))
mother(“Eve”)
2.34
point(4,7)
(not defined)

Map
Map only defines values for ground terms




Variables (X,Y,…) let us define values for ∞ly many ground terms
Compute a map that satisfies the equations in the program
Not guaranteed to halt (Dyna is Turing-complete, unlike Datalog)
Not guaranteed to be unique
69
Fixpoint semantics

A Dyna program is a finite rule set
that defines a partial function (“map”)
ground terms
(variable-free)
values
(also terms)
weight(feature(word=“pet”,tag=“Noun”))
northeast(point(3,6))
mother(“Eve”)
2.34
point(4,7)
(not defined)


Map
Map only defines values for ground terms
Map may accept modifications at runtime

Runtime input
Adjustments to input (dynamic algorithms)

Retraction (remove input), detachment (forget input but preserve output)

70
“Object-oriented” features

Maps are terms, i.e., first-class objects
weight(feature(word=“pet”,tag=“Noun”))
northeast(point(3,6))
mother(“Eve”)

Maps can appear as subterms or as values

Useful for encapsulating data and passing it around:



fst3 = compose(fst1, fst2). % value of fst3 is a chart
forest = parse(sentence).
Typed by their public interface:


Map
2.34
point(4,7)
(not defined)
fst4->edge(Q,R) += fst3->edge(R,Q).
Maps can be stored in files and loaded from files


Human-readable format (looks like a Dyna program)
Binary format (mimics in-memory layout)
71
Functional features: Auto-evaluation


Terms can have values.
So by default, subterms are evaluated in place.

Arranged by a simple desugaring transformation:
foo( X ) += 3*bar(X). 2 things to evaluate here: bar and *
 foo( X ) += B is bar(X), Result is 3*B, Result.
each “is” pattern-matches against the
chart (which conceptually contains
pairs such as 49 is bar(7))

Possible to suppress evaluation &f(x) or force it *f(x)


Some contexts also suppress evaluation.
Variables are replaced with their bindings
but not otherwise evaluated.
72
Functional features: Auto-evaluation


Terms can have values.
So by default, subterms are evaluated in place.

Arranged by a simple desugaring transformation:
foo(f(X)) += 3*bar(g(X)).
 foo( F ) += F is f(X), G is g(X), B is bar(G), Result is 3*B, Result.

Possible to suppress evaluation &f(x) or force it *f(x)


Some contexts also suppress evaluation.
Variables are replaced with their bindings
but not otherwise evaluated.
73
Other handy features


Guard condition on a rule:
If X is true, then X,Y has value Y.
Otherwise X,Y is not provable.
fact(0) = 1.
fact(int N) = N > 0, N*fact(N-1).
Restricts applicability of this rule.
(Note: There’s a strong type system, but
it’s optional. Use it as desired for safety
and efficiency, and to
control the implementation.)


Degenerate aggregator.
Like +=, but it’s an error if it tries
to aggregate more than one value.
0! = 1. user-defined syntactic sugar
(int N)! = N*(N-1)! if N ≥ 1.
Unicode
74
Aggregation operators




f(X) = 3.
f(X) += 3.
f(X) min= 3.
f(X) := 3.
by further += rules
at compile-time or runtime
% immutable
% can be incremented later
% can be reduced later
% can be arbitrarily changed later
Pseudo-aggregator: Later values (in source code) override earlier ones.
Can regard this as a true aggregator on (source line, value) pairs.

f(X) => 3.
% like = but can be overridden by
more specific rule
75
Aggregation operators

f(X) := 1.
% can be arbitrarily changed later
Pseudo-aggregator: Later values override earlier ones.

Non-monotonic reasoning:




flies(bird X) := true.
flies(bird X) := penguin(X), false. % overrides
flies(bigbird) := false.
% also overrides
Iterative update algorithms (EM, Gibbs, BP):



a := init_a.
a := updated_a(b). % will override once b is proved
b := updated_b(a).
76
Declarations
(ultimately, should be chosen automatically)

at term level




lazy vs. eager computational strategies
memoization and flushing strategies
prioritization, parallelization, etc.
at class level

class = an implementation of a type



type = some subset of the term universe
class specifies storage strategy
classes may implement overlapping types
77
Frozen variables


Dyna map semantics concerns ground terms.
But want to be able to reason about non-ground terms, too.
 Manipulate Dyna rules (which are non-ground terms)
 Work with classes of ground terms (specified by non-ground terms)


Queries, memoized queries …
Memoization, updating, prioritization of updates, …

So, allow ground terms that contain “frozen variables”
 Treatment under unification is beyond scope of this talk

$priority(f(X)) = f(X).
$priority(#f(X)) = infinity.

% for each X
% frozen non-ground term
78
Gensyms
79
Some More Examples
Shortest paths
n-gram smoothing
Neural nets
Arc consistency
Vector-space IR
Game trees
FST composition
Edit distance
Generalized A* parsing
80
Path-finding in Prolog


1

pathto(1). % the start of all paths
pathto(V) :- edge(U,V), pathto(U).
When is the query pathto(14) really inefficient?
2
5
8
11
3
6
9
12
4
7
10
13
14
What’s wrong with this swapped version?
pathto(V) :- pathto(U), edge(U,V).
81
Shortest paths in Dyna

Single source:



pathto(start) min= 0.
pathto(W) min= pathto(V) + edge(V,W).
All pairs:


can change min= to += to sum
over paths (e.g., PageRank)
path(U,U) min= 0.
path(U,W) min= path(U,V) + edge(V,W).
A*

This hint gives Dijkstra’s algorithm (pqueue):


$priority(pathto(V) min= Delta) = Delta.+ heuristic(V).
Must also declare that pathto(V) has converged as soon as it
pops off the priority queue; this is true if heuristic is admissible.
82
Neural networks in Dyna
y
h1


out(Node) = sigmoid(in(Node)).
sigmoid(X) = 1/(1+exp(-X)).
x1
h3
h2
x2
y'
x3
x4

in(Node) += weight(Node,Previous)*out(Previous).
in(Node) += input(Node).

error += (out(Node)-target(Node))**2.

only defined for a few nodes
Recurrent neural net is ok
83
Vector-space IR in Dyna




bestscore(Query) max= score(Query,Doc).
score(Query,Doc) +=
tf(Query,Word)*tf(Doc,Word)*idf(Word).
idf(Word) = 1/log(df(Word)).
df(Word) += 1 whenever tf(Doc,Word) > 0.
84
Weighted FST composition in Dyna
(epsilon-free case)





start(A o B) = start(A) x start(B).
stop(A o B, Q x R) += stop (A, Q) & stop (B, R).
arc(A o B, Q1 x R1, Q2 x R2, In, Out)
+= arc(A, Q1, Q2, In, Match)
* arc(B, R1, R2, Match, Out).
Computes full cross-product.
Use magic templates transform to build only
reachable states.
85
n-gram smoothing in Dyna



These values all update automatically
during leave-one-out jackknifing.
mle_prob(X,Y,Z) = count(X,Y,Z)/count(X,Y).
smoothed_prob(X,Y,Z)
= λ*mle_prob(X,Y,Z) + (1-λ)*mle_prob(Y,Z).
for arbitrary-length contexts, could use lists

count_of_count(X,Y,count(X,Y,Z)) += 1.

Used for Good-Turing and Kneser-Ney smoothing.
E.g., count_of_count(“the”, “big”, 1) is number of word types
that appeared exactly once after “the big.”
86
Arc consistency (= 2-consistency)
Agenda algorithm …
X:3 has no support in Y, so kill it off
Y:1 has no support in X, so kill it off
Z:1 just lost its only support in Y, so kill it off
X
X, Y, Z, T :: 1..3
X # Y
Y #= Z
T # Z
X #< T
1, 2, 3

Y
1, 2, 3
Note: These
steps can occur
in somewhat
arbitrary order

1, 2, 3
T

=
1, 2, 3
Z
87
slide thanks to Rina Dechter (modified)
Arc consistency in Dyna (AC-4 algorithm)

Axioms (alternatively, could define them by rule):


indomain(Var:Val) := …
% define some values true
consistent(Var:Val, Var2:Val2) := …



For Var:Val to be kept, Val must be in-domain and
also not ruled out by any Var2 that cares:



Define to be true or false if Var, Var2 are co-constrained.
Otherwise, leave undefined (or define as true).
possible(Var:Val) &= indomain(Var:Val).
possible(Var:Val) &= supported(Var:Val, Var2).
Var2 cares if it’s co-constrained with Var:Val:

supported(Var:Val, Var2)
|= consistent(Var:Val, Var2:Val2) & possible(Var2:Val2).
88
Propagating bounds consistency in Dyna

E.g., suppose we have a constraint A #<= B
(as well as other constraints on A). Then



maxval(a) min= maxval(b).
% if B’s max is reduced, then A’s should be too
minval(b) max= minval(a). % by symmetry
Similarly, if C+D #= 10, then




maxval(c) min= 10-minval(d).
maxval(d) min= 10-minval(c).
minval(c) max= 10-maxval(d).
minval(d) max= 10-maxval(c).
89
Game-tree analysis
All values represent total advantage to player 1 starting at this board.








% how good is Board for player 1, if it’s player 1’s move?
best(Board) max= stop(player1, Board).
best(Board) max=
move(player1, Board, NewBoard) + worst(NewBoard).
% how good is Board for player 1, if it’s player 2’s move?
(player 2 is trying to make player 1 lose: zero-sum game)
worst(Board) min= stop(player2, Board).
worst(Board) min=
move(player2, Board, NewBoard) + best(NewBoard).
% How good for player 1 is the starting board?
goal = best(Board) if start(Board).
90
Edit distance between two strings
4 edits
clara
caca
Traditional picture
clara
3 edits caca
2
clara
c aca
3
cla ra
c ac a
9
clara
caca
91
Edit distance in Dyna

dist([], []) = 0.
dist([X|Xs],Ys) min= dist(Xs,Ys) + delcost(X).
dist(Xs,[Y|Ys]) min= dist(Xs,Ys) + inscost(Y).
dist([X|Xs],[Y|Ys]) min= dist(Xs,Ys) + substcost(X,Y).
substcost(L,L) = 0.

result = align([“c”, “l”, “a”, “r”, “a”], [“c”, “a”, “c”, “a”]).




92
Edit distance in Dyna on input lattices

dist(S,T) min= dist(S,T,Q,R) + Sfinal(Q) + Tfinal(R).

dist(S,T, S->start, T->start) min= 0.

dist(S,T, I2, J) min= dist(S,T, I, J) + Sarc(I,I2,X) + delcost(X).
dist(S,T, I, J2) min= dist(S,T, I, J) + Tarc(J,J2,Y) + inscost(Y).
dist(S,T, I2,J2) min= dist(S,T, I, J) + Sarc(I,I2,X) +
Sarc(J,J2,Y) + substcost(X,Y).
substcost(L,L) = 0.





result = dist(lattice1, lattice2).
lattice1 = { start=state(0).
arc(state(0),state(1),“c”)=0.3.
arc(state(1),state(2),“l”)=0. …
final(state(5)). }
93
Generalized A* parsing (CKY)









% Get Viterbi outside probabilities.
% Isomorphic to automatic differentiation (reverse mode).
outside(goal) = 1.
outside(Body) max= outside(Head)
whenever $rule(Head max= Body).
outside(phrase B) max= (*phrase A) * outside(&(A*B)).
outside(phrase A) max= outside(&(A*B)) * (*phrase B).
% Prioritize by outside estimates from coarsened grammar.
$priority(phrase P) = (*P) * outside(coarsen(P)).
$priority(phrase P) = 1 if P==coarsen(P).
% can't coarsen any further
94
Generalized A* parsing (CKY)











% coarsen nonterminals.
coa("PluralNoun") = "Noun".
coa("Noun") = "Anything".
coa("Anything") = "Anything". …
% coarsen phrases.
coarsen(&phrase(X,I,J)) = &phrase(coa(X),I,J).
% make successively coarser grammars
% each is an admissible estimate for the next-finer one.
coarsen(rewrite(X,Y,Z)) = rewrite(coa(X),coa(Y),coa(Z)).
coarsen(rewrite(X,Word)) = rewrite(coa(X),Word).
*coarsen(Rule) max= Rule.
i.e., Coarse max= Rule whenever Coarse=coarsen(Rule).
95
Lightweight information interchange?

Easy for Dyna terms to represent:






XML data (Dyna types are analogous to DTDs)
RDF triples (semantic web)
Annotated corpora
Ontologies
Graphs, automata, social networks
Also provides facilities missing from semantic web:



Queries against these data
State generalizations (rules, defaults) using variables
Aggregate data and draw conclusions



Keep track of provenance (backpointers)
Keep track of confidence (weights)
Map = deductive database in a box

Like a spreadsheet, but more powerful, safer to maintain,
and can communicate with outside world
96
How fast was the prototype version?


It used “one size fits all” strategies
Asymptotically optimal, but:



4 times slower than Mark Johnson’s inside-outside
4-11 times slower than Klein & Manning’s Viterbi parser
5-6x speedup not too hard to get
97
Are you going to make it faster? (yup!)


Static analysis
Mixed storage strategies



Mixed inference strategies


store X in an array
store Y in a hash
don’t store Z
(compute on demand)
Choose strategies by


User declarations
Automatically by
execution profiling
98
Summary

AI systems are too hard to write and modify.


Dyna is a language for computation (no I/O)



Need a new layer of abstraction.
Simple, powerful idea:
Define values from other values by weighted logic.
Produces classes that interface with C++, etc.
Compiler supports many implementation strategies




Tries to abstract and generalize many tricks
Fitting a strategy to the workload is a great opportunity for learning!
Natural fit to fine-grained parallelization
Natural fit to web services
99
Dyna contributors!

Prototype (available):


All-new version (under development):


Eric Goldlust (core compiler), Noah A. Smith (parameter
training), Markus Dreyer (front-end processing),
David A. Smith, Roy Tromble, Asheesh Laroia
Nathaniel Filardo (core compiler), Wren Ng Thornton (core
compiler), Jay Van Der Wall (source language parser),
John Blatz (transformations and inference), Johnny
Graettinger (early design), Eric Northup (early design)
Dynasty hypergraph browser (usable):

Michael Kornbluh (initial version), Gordon Woodhull (graph
layout), Samuel Huang (latest version), George Shafer,
Raymond Buse, Constantinos Michael
100
FIN
101
New examples of dynamic
programming in NLP
Parameterized finite-state machines
111
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters: they are formulas.
a/q
/p
/1-p
a/r
b/(1-q)r
1-s
a/q*exp(t+u)
112
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters: they are formulas.
a/.2
/.1
/.9
a/.3
b/.8
.7
a/.44
113
Parameterized FSMs

An FSM whose arc probabilities depend on
parameters: they are formulas.
a/q
/p
/1-p
a/r
b/(1-q)r
a/q*exp(t+u)
1-s
Expert first: Construct
the FSM (topology &
parameterization).
Automatic takes over:
Given training data, find
parameter values
that optimize arc probs.
114
Parameterized FSMs
Knight & Graehl
1997 - transliteration
“/t/ and /d/ are similar …”
p(English text)
o
p(English text 
English phonemes)
o
p(English phonemes 
Japanese phonemes)
o
p(Japanese phonemes
 Japanese text)
Loosely coupled probabilities:
/t/:/tt/
exp p+q+r (coronal, stop,
unvoiced)
/d/:/dd/
exp p+q+s (coronal, stop,
voiced)
115
Parameterized FSMs
p(English text)
o
p(English text 
English phonemes)
o
p(English phonemes 
Japanese phonemes)
o
p(Japanese phonemes
 Japanese text)
Knight & Graehl
1997 - transliteration
“Would like to get some of that
expert knowledge in here”
Use probabilistic regexps like
(a*.7 b) +.5 (ab*.6) …
If the probabilities are variables
(a*x b) +y (ab*z) …
then arc weights of the compiled
machine are nasty formulas.
(Especially after minimization!)
116
New examples of dynamic
programming in NLP
Parameterized infinite-state machines
125
Universal grammar as a parameterized
FSA over an infinite state space
126
New examples of dynamic
programming in NLP
More abuses of finite-state machines
127
Huge-alphabet FSAs for OT phonology
etc.
Gen proposes all
candidates that include
this input.
voi
C V C C
voi
voi
C V C C
C V C C V
C V C C
C
C C
velar
C
}
surface
tiers
voi
voi
V
underlying
tiers
C C
C V C C
voi
C V C C
128
Huge-alphabet FSAs for OT phonology
encode this candidate as a string
voi
C V C C
velar
V
C
C C
C
C C
at each moment,
need to describe
what’s going on
on many tiers
129
Directional Best Paths construction


Keep “best” output string for each input string
Yields a new transducer (size  3n)
For input abc: abc axc
For input abd: axd
1
a:a
b:b
c:c
3
2
c:c
b:x
4
6
Must allow red arc
just if next input is d
5
d:d
7
130
Minimization of semiring-weighted FSAs



New definition of  for pushing:
 (q) = weight of the shortest path from q,
breaking ties alphabetically on input symbols
Computation is simple, well-defined, independent of (K, )
Breadth-first search back from final states:
q
b
a:k
a
c
distance 2
Compute (q)
in O(1) time
as soon as we visit q.
Whole alg. is linear.
b
b
c
r
(q) = k  (r)
d
Faster than finding
min-weight path
à la Mohri.
131
New examples of dynamic
programming in NLP
Tree-to-tree alignment
132
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation from French to English.
donnent
(“give”)
kiss
à (“to”)
baiser
(“kiss”)
un
(“a”)
beaucoup
(“lots”)
Sam
Sam
kids
often
quite
d’
(“of”)
enfants
(“kids”)
“beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation from French to English.
A possible alignment is shown in orange.
donnent
(“give”)
Start
kiss
à (“to”)
baiser
(“kiss”)
un
(“a”)
beaucoup
(“lots”)
d’
(“of”)
Sam
Sam
NP
Adv
kids
null
null
NP
Adv
often
quite
NP
enfants
(“kids”)
“beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation from French to English.
A possible alignment is shown in orange.
Alignment shows how trees are generated synchronously from “little trees” ...
donnent
(“give”)
Start
kiss
à (“to”)
baiser
(“kiss”)
un
(“a”)
beaucoup
(“lots”)
d’
(“of”)
Sam
Sam
NP
Adv
kids
null
null
NP
Adv
often
quite
NP
enfants
(“kids”)
“beaucoup d’enfants donnent un baiser à Sam”  “kids kiss Sam quite often”
New examples of dynamic
programming in NLP
Bilexical parsing in O(n3)
(with Giorgio Satta)
136
Lexicalized CKY
[ Mary ]
loves
[ [ the]
girl
[ outdoors ] ]
137
Lexicalized CKY is
... advocate
... hug
B
i
h
5
O(n )
not
3
O(n )
visiting relatives
visiting relatives
C
j
j+1
h’
k
O(n3 combinations)
A
i
h
O(n5 combinations)
k
138
Idea #1

Combine B with what C?

B
i h
C
j
i

must try differently-headed
C’s (vary h’)

Separate these!
j+1 h’ k
A
must try different-width C’s
(vary k)
h’ k
139
Idea #1
B
C
h’
(the old CKY way)
B
i h
i h
j
A
C
j
j+1 h’ k
C
i
j
h’ k
j+1 h’ k
A
A
i
h’
C
i
h’ k
140
Idea #2

Some grammars allow
A A
i
h h
k
A
i
k
141
Idea #2

Combine what B and C?

B
i h
C
j
i

must try different midpoints
j

Separate these!
j+1 h’ k
A
must try different-width C’s
(vary k)
h’ k
142
Idea #2
B
(the old CKY way)
B
i h
i h
j+1 h’ k
j
A
C
j
C
i
h
j+1 h’
C
h’
k
h’ k
A
A
i h
C
i h
k
143
Idea #2
B
(the old CKY way)
B
i h
C
j
j+1 h’ k
A
i h
k
C
h
j
A
h
j+1 h’
C
h’
C
h’ k
A
h
k
144
An
3
O(n )
[ Mary ]
algorithm (with G. Satta)
loves
[ [ the]
girl
[ outdoors ] ]
145
3 parsers: log-log plot
100000
10000
exhaustive
1000
Time
pruned
NAIVE
IWPT-97
ACL-99
NAIVE
100
IWPT-97
ACL-99
10
1
10
100
Sentence Length
146
New examples of dynamic
programming in NLP
O(n)-time partial parsing
by limiting dependency length
(with Noah A. Smith)
147
Short-Dependency Preference
A word’s dependents (adjuncts, arguments)
tend to fall near it
in the string.
length of a dependency ≈ surface distance
3
1
1
1
fraction of all dependencies
0.6
50% of English
dependencies
have length 1,
another 20% have
length 2, 10%
have length 3 ...
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
0.1
length
1
10
100
1000
0
0
50
English
100
150
Chinese
200
German
250
Related Ideas
• Score parses based on what’s between a head and
child
(Collins, 1997; Zeman, 2004; McDonald et al., 2005)
• Assume short → faster human processing
(Church, 1980; Gibson, 1998)
• “Attach low” heuristic for PPs (English)
(Frazier, 1979; Hobbs and Bear, 1990)
• Obligatory and optional re-orderings (English)
(see paper)
Going to Extremes
Longer dependencies are less likely.
0.6
What if we eliminate them completely?
0.5
0.4
0.3
0.2
0.1
0
1
10
English
100
Chinese
1000
German
Hard Constraints
Disallow dependencies between words of
distance > b ...
Risk: best parse contrived, or no parse at all!
Solution: allow fragments (partial parsing;
Hindle, 1990 inter alia).
Why not model the sequence of fragments?
Building a Vine SBG Parser
Grammar: generates sequence of trees from $
Parser: recognizes sequences of trees without long
dependencies
Need to modify training data
so the model is consistent
with the parser.
$
8
According
1
4
1
,
changes
2
to
2
the
estimates
9
would
1
rule
1
.
cut
3
2
filings
1
insider
by
1
more
1
some
1
than
2
third
(from the Penn Treebank)
a
1
$
would
4
According
1
1
,
changes
2
to
2
the
1
rule
estimates
1
.
cut
3
2
filings
1
insider
by
1
more
1
some
1
b=4
than
2
third
(from the Penn Treebank)
a
1
$
would
According
1
1
,
changes
2
to
2
the
1
rule
estimates
1
.
cut
3
2
filings
1
insider
by
1
more
1
some
1
b=3
than
2
third
(from the Penn Treebank)
a
1
$
would
According
1
1
,
changes
2
to
2
the
1
rule
estimates
1
.
cut
2
filings
1
insider
by
1
more
1
some
1
b=2
than
2
third
(from the Penn Treebank)
a
1
$
would
According
1
1
,
changes
to
1
.
cut
1
the
rule
estimates
filings
1
insider
by
1
more
1
some
1
b=1
than
third
(from the Penn Treebank)
a
1
$
would
According
.
,
changes
to
the
filings
by
rule
estimates
some
cut
insider
more
b=0
than
third
(from the Penn Treebank)
a
Vine Grammar is Regular
• Even for small b, “bunches” can grow to
arbitrary size:
• But arbitrary center embedding is out:
Limiting dependency length

Linear-time partial parsing:
Finite-state model of root sequence
NP
S
NP
Bounded dependency
length within each chunk
(but chunk could be arbitrarily
wide: right- or left- branching)



Natural-language dependencies tend to be short
So even if you don’t have enough data to model what the heads are …
… you might want to keep track of where they are.
164
Limiting dependency length

Linear-time partial parsing:
Finite-state model of root sequence
NP

S
Don’t convert into an FSA!



NP
Bounded dependency
length within each chunk
(but chunk could be arbitrarily
wide: right- or left- branching)
Less structure sharing
Explosion of states for different stack configurations
Hard to get your parse back
165
Limiting dependency length

Linear-time partial parsing:
NP
one
dependency
S
NP
Each piece is at most k words
wide
No dependencies between pieces
Finite state model of sequence
 Linear time! O(k2n)
166
Limiting dependency length

Linear-time partial parsing:
NP
S
NP
Each piece is at most k words
wide
No dependencies between pieces
Finite state model of sequence
 Linear time! O(k2n)
167
Parsing Algorithm
• Same grammar constant as Eisner and Satta
(1999)
• O(n3) → O(nb2) runtime
• Includes some overhead (low-order term)
for constructing the vine
– Reality check ... is it worth it?
F-measure & runtime of a
limited-dependency-length parser (POS seqs)
171
Precision & recall of a
limited-dependency-length parser (POS seqs)
172
New examples of dynamic
programming in NLP
Grammar induction by initially
limiting dependency length
(with Noah A. Smith)
187
Soft bias toward short dependencies
Multiply parse probability by exp -δS
where S is the total length of all dependencies
Then renormalize probabilities
MLE baseline
-∞
linear structure preferred
δ=0
+∞
189
Structural Annealing
MLE baseline
δ=0
-∞
+∞
Repeat ...
Increase δ and
retrain.
Start here; train a model.
Until performance stops
improving on a small
validation dataset.
190
20
German
50.3
63.4
70.0
English
41.6
57.4
61.8
Bulgarian
45.6
40.5
58.4
Mandarin
50.1
41.1
56.4
Turkish
48.0
58.2
62.4
Portuguese
42.3
71.8
50.4
30
40
50
60
70
Grammar
Induction
Other structural biases
MLE
can be annealed. CE: Deletions &
Transpositions
Structural Annealing
We tried annealing on
connectivity (# of
fragments), and got
similar results.
191
Treebank:
the
gene
6
A /9-Accurate
Parse
These errors
thus can
look like ones
made by a
prevent supervised
plant from fertilizing
parser in 2000!
a
itself
MLE with locality bias:
verb instead of modal as root
the
preposition misattachment
prevent
gene
thus
can
plant
a
from
fertilizing itself
misattachment of adverb “thus”
192
Accuracy Improvements
random Klein &
language
Manning
tree
(2004)
German
Smith &
Eisner
(2006)
state-ofthe-art,
supervised
27.5%
50.3
70.0
82.61
English
30.3
41.6
61.8
90.92
Bulgarian
30.4
45.6
58.4
85.91
Mandarin
22.6
50.1
57.2
84.61
Turkish
29.8
48.0
62.4
69.61
Portuguese
30.6
42.3
71.8
86.51
1CoNLL-X
shared task, best system.
2McDonald
193
et al., 2005
Combining with Contrastive Estimation
 This generally gives us our best results …
194
New examples of dynamic
programming in NLP
Contrastive estimation for HMM and grammar induction
Uses lattice parsing …
(with Noah A. Smith)
195
Contrastive Estimation:
Training Log-Linear Models
on Unlabeled Data
Noah A. Smith and Jason Eisner
Department of Computer Science /
Center for Language and Speech Processing
Johns Hopkins University
{nasmith,jason}@cs.jhu.edu
Contrastive Estimation:
(Efficiently) Training Log-Linear
Models (of Sequences) on
Unlabeled Data
Noah A. Smith and Jason Eisner
Department of Computer Science /
Center for Language and Speech Processing
Johns Hopkins University
{nasmith,jason}@cs.jhu.edu
Nutshell Version
unannotated
text
tractable
training
contrastive estimation
with lattice neighborhoods
Experiments on unlabeled data:
“max ent” features
sequence models
POS tagging: 46% error rate
reduction (relative to EM)
“Max ent” features make it possible
to survive damage to tag dictionary
Dependency parsing: 21%
attachment error reduction
(relative to EM)
“Red leaves don’t hide blue jays.”
Maximum Likelihood Estimation
(Supervised)
p
JJ
NNS
MD
VB
JJ
NNS
y
red
leaves
don’t
hide
blue
jays
x
p
Σ* × Λ*
?
?
*
Maximum Likelihood Estimation
(Unsupervised)
p
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
This is what
EM does.
Σ* × Λ*
p
?
?
*
x
Focusing Probability Mass
numerator
denominator
Conditional Estimation
(Supervised)
p
p
JJ
NNS
MD
VB
JJ
NNS
y
red
leaves
don’t
hide
blue
jays
x
?
?
?
?
?
?
red
leaves
hide
blue
jays
A different
don’t
denominator!
(x) ×
Λ*
Objective Functions
Objective
MLE
MLE with
hidden
Contrastive
variables
Estimation
Conditional
Likelihood
Perceptron
Optimization
Algorithm
Count &
Normalize*
generic
numerical
EM*
solvers
(inIterative
this talk,
LMVM
Scaling
L-BFGS)
Backprop
Numerator
Denominator
tags
& words
observed
data
Σ* × Λ*
(in words
this talk,
raw word
sequence,
tags
words
sum&over
all
possible
taggings)
tags
& words
Σ* × Λ*
?
(words) × Λ*
hypothesized
tags & words
*For generative models.
This talk is about denominators ...
in the unsupervised case.
A good denominator can improve
accuracy
and
tractability.
Language Learning (Syntax)
red
leaves
don’t
hide
blue
jays
At last! My own
language learning
device!
EM
didn’t
hesequence
say,
Why didWhy
he pick
that
for
those
words? granola”
“birds fly”
or “dancing
“the
dishes”
or or
Whyornot
saywash
“leaves
red ...”
hidesequence
don’t ...”oforwords?
...
any“...
other
What is a syntax model supposed to explain?
Each learning
hypothesis
corresponds to
a denominator
/ neighborhood.
The Job of Syntax
“Explain why each word is necessary.”
→ DEL1WORD neighborhood
red don’t hide blue jays
leaves don’t hide blue jays
red leaves hide blue jays
red leaves don’t hide blue jays
red leaves don’t hide blue
red leaves don’t blue jays
red leaves don’t hide jays
The Job of Syntax
“Explain the (local) order of the words.”
→ TRANS1 neighborhood
red don’t leaves hide blue jays
leaves red don’t hide blue jays
red leaves don’t hide blue jays
red leaves hide don’t blue jays
red leaves don’t hide jays blue
red leaves don’t blue hide jays
p
p
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
?
?
?
?
?
?
leaves
red
don’t
hide
blue
jays
?
?
?
?
?
?
red
don’t
leaves
hide
blue
jays
?
?
?
?
?
?
red
leaves
hide
don’t
blue
jays
?
?
?
?
?
?
red
leaves
don’t
blue
hide
jays
?
?
?
?
?
?
red
leaves
don’t
hide
jays
blue
sentences in
TRANS1
neighborhood
p
?
?
?
?
?
?
red
leaves
don’t
hide
blue
jays
leaves
don’t
red
hide
www.dyna.org
p
blue
jays
blue
jays
(shameless self promotion)
don’t
(with any tagging)
hide
sentences in
TRANS1
neighborhood
The New Modeling Imperative
A good
sentence hints
that a set of
bad ones is
nearby.
numerator
denominator
(“neighborhood”)
“Make the good sentence
likely, at the expense
of those bad neighbors.”
This talk is about denominators ...
in the unsupervised case.
A good denominator can improve
accuracy
and
tractability.
Log-Linear Models
score of x, y
partition
function
Computing Z
is undesirable!
Conditional
Estimation
(Supervised)
1 sentence
Contrastive
Estimation
(Unsupervised)
a few
sentences
Sums over all
possible
taggings of all
possible
sentences!
A Big Picture: Sequence Model Estimation
unannotated data
tractable sums
generative,
EM: p(x)
log-linear,
EM: p(x)
log-linear,
MLE: p(x, y)
generative,
MLE: p(x, y)
log-linear,
conditional
estimation:
p(y | x)
overlapping
features
log-linear,
CE with
lattice
neighborhoods
Contrastive Neighborhoods
• Guide the learner toward models that do
what syntax is supposed to do.
• Lattice representation → efficient algorithms.
There is an art
to choosing
neighborhood
functions.
Neighborhoods
neighborhood
size
lattice
arcs
DEL1WORD
n+1
O(n)
delete up to 1 word
TRANS1
n
O(n)
transpose any bigram
DELORTRANS1
O(n)
O(n)
perturbations
DEL1WORD  TRANS1
DEL1SUBSEQUENCE O(n2) O(n2) delete any contiguous subsequence
Σ*
(EM)
∞
-
replace each word with anything
The Merialdo (1994) Task
Given unlabeled text
and a POS dictionary
(that tells all possible tags for each word type),
A form of
supervision.
learn to tag.
Trigram Tagging Model
JJ
NNS
MD
VB
JJ
NNS
red
leaves
don’t
hide
blue
jays
feature set:
tag trigrams
tag/word pairs from a POS dictionary
CRF
99.5
≈ log-linear
supervised
HMM
EM
97.2
LENGTH
79.3
TRANS1
79.0
DELORTRANS1
78.8
10 × data
DA
Smith & Eisner (2004)
EM
Merialdo (1994)
EM
66.6
62.1
DEL1WORD
60.4
DEL1SUBSEQUENCE
random
70.0
58.7
35.1
• 96K words
30.0
40.0
50.0
60.0
70.0
80.0
90.0
• full POS dictionary
tagging accuracy (ambiguous words)
• uninformative initializer
• best of 8 smoothing conditions
100.0
55.0
50.0
51.0
60.0
90.4
84.4
80.5
69.5
Dictionary excludes
OOV words,
which can get any tag.
60.5
56.6
65.0
What if we
damage
the POS
dictionary?
■ all words
■ words from 1st half of corpus
■ words with count  2
■ words with count  3
66.5
70.0
Dictionary includes ...
70.9
75.0
78.3
75.2
72.3
80.0
77.2
85.0
84.8
81.3
tagging accuracy (all words)
90.0
90.1
95.0
• 96K words
• 17 coarse POS tags
• uninformative initializer
Trigram Tagging Model + Spelling
JJ
NNS
MD
VB
JJ
NNS
red
leaves
don’t
hide
blue
jays
feature set:
tag trigrams
tag/word pairs from a POS dictionary
1- to 3-character suffixes, contains hyphen, digit
60.0
55.0
50.0
60.5
56.6
65.0
73.8
73.6
70.9
69.5
66.5
70.0
78.3
75.2
72.3
75.0
77.2
80.0
84.8
81.3
85.0
90.4
90.1
90.0
51.0
tagging accuracy (all words)
95.0
83.2
84.4
80.5
91.9
91.1
90.8
90.3
89.8
Log-linear
spelling
features
aided
recovery ...
... but only
with a smart
neighborhood.
The model need not be finite-state.
35.2
42.1
23.6
33.8
48.7
50.0
40.0
attachment accuracy
Klein & Manning (2004)
Unsupervised Dependency Parsing
37.4
30.0
EM
20.0
10.0
LENGTH
0.0
unin
TRANS1
initializer
cleve
r
form
a
tive
To Sum Up ...
Contrastive Estimation means
picking your own denominator
for tractability
or for accuracy
(or, as in our case, for both).
Now we can use the task to guide the unsupervised learner
(like discriminative techniques do for supervised learners).
It’s a particularly good fit for log-linear models:
unsupervised sequence models
with max ent features
all in time for ACL 2006.