Learning Approximate Inference Policies for Fast Prediction Jason Eisner ICML “Inferning” Workshop June 2012

Download Report

Transcript Learning Approximate Inference Policies for Fast Prediction Jason Eisner ICML “Inferning” Workshop June 2012

Learning Approximate Inference
Policies for Fast Prediction
Jason Eisner
ICML “Inferning” Workshop
June 2012
1
Beware: Bayesians in Roadway
A Bayesian is the person who writes down
the function you wish you could optimize
semantics
lexicon (word types)
entailment
correlation
inflection
cognates
transliteration
abbreviation
neologism
language evolution
tokens
sentences
N
translation
alignment
editing
quotation
discourse context
resources
speech
misspellings,typos
formatting
entanglement
annotation
To recover variables,
model and exploit
their correlations
Motivating Tasks

Structured prediction (e.g., for NLP problems)
 Parsing ( trees)
 Machine translation ( word strings)
 Word variants ( letter strings, phylogenies, grids)

Unsupervised learning via Bayesian generative models
 Given a few verb conjugation tables and a lot of text

Find/organize/impute all verb conjugation tables of the language
Motivating Tasks

Structured prediction (e.g., for NLP problems)
 Parsing ( trees)
 Machine translation ( word strings)
 Word variants ( letter strings, phylogenies, grids)

Unsupervised learning via Bayesian generative models
 Given a few verb conjugation tables and a lot of text


Find/organize/impute all verb conjugation tables of the language
Given some facts and a lot of text

Discover more facts through information extraction and reasoning
Current Methods

Dynamic programming


Approximate inference in graphical models



Exact but slow
Are approximations any good?
May use dynamic programming as subroutine
(structured BP)
Sequential classification
Speed-Accuracy Tradeoffs

Inference requires lots of computation

Is some computation going to waste?



Is some computation actively harmful?


Sometimes the best prediction is overdetermined …
Quick ad hoc methods sometimes work: how to respond?
In approximate inference, passing a message can hurt
Frustrating to simplify model just to fix this



Want to keep improving our models!
But need good fast approximate inference
Choose approximations automatically


Tuned to data distribution & loss function
“Trainable hacks” – more robust
This talk is about “trainable hacks”
training
data
Prediction device
feedback
likelihood
(suitable for domain)
This talk is about “trainable hacks”
training
data
Prediction device
feedback
loss
+ runtime
(suitable for domain)
Bayesian Decision Theory
Loss
Optimized
parameters of
prediction rule




Data
distribution
Prediction
rule
What prediction rule? (approximate inference + beyond)
What loss function?
(can include runtime)
How to optimize?
(backprop, RL, …)
What data distribution?
(may have to impute)
This talk is about “trainable hacks”
Probabilistic
domain model
Complete
training
data
Prediction device
Partial
data
feedback
loss
+ runtime
(suitable for domain)
Part 1:
Your favorite approximate inference
algorithm is a trainable hack
General CRFs: Unrestricted model structure
Y2
.
Y1
Y4
X1



X2
Y3
X3
Add edges to model the conditional distribution well.
But exact inference is intractable.
So use loopy sum-product or max-product BP.
14
General CRFs: Unrestricted model structure

Inference: compute properties of the
posterior distribution.
DT .9
NN .05
…
NN .8
JJ .1
…
VBD .7
VB .1
…
IN .9
NN .01
…
DT .9
NN .05
…
NN .4
JJ .3
…
. .99
, .001
…
The
cat
sat
on
the
mat
.
15
General CRFs: Unrestricted model structure

Decoding: coming up with predictions from
the results of inference.
DT
NN
VBD
IN
DT
NN
.
The
cat
sat
on
the
mat
.
16
General CRFs: Unrestricted model structure

One uses CRFs with several approximations:






Approximate inference.
Approximate decoding.
Mis-specified model structure.
MAP training (vs. Bayesian).
Could be present
in linear-chain
CRFs as well.
Why are we still maximizing data likelihood?
Our system is more like a Bayes-inspired
neural network that makes predictions.
17
Train directly to minimize task loss
(Stoyanov, Ropson, & Eisner 2011; Stoyanov & Eisner 2012)
x


(Appr.)
(Appr.)
Inference
Decoding
Black box decision
p(y|x) function ŷ
parameterized by ϴ
L(y*,ŷ)
Adjust ϴ to (locally) minimize training loss
 E.g., via back-propagation (+ annealing)
“Empirical Risk Minimization under Approximations (ERMA)”
18
Optimization Criteria
Approximation Aware
Yes
Yes
Loss Aware
No
No
19
Optimization Criteria
Approximation Aware
Yes
MLE
Yes
Loss Aware
No
No
20
Optimization Criteria
Approximation Aware
Yes
MLE
SVMstruct
Yes
Loss Aware
No
No
[Finley and Joachims, 2008]
M3N
[Taskar et al., 2003]
Softmax-margin
[Gimpel & Smith, 2010]
21
Optimization Criteria
Approximation Aware
Yes
MLE
SVMstruct
Yes
Loss Aware
No
No
ERMA
[Finley and Joachims, 2008]
M3N
[Taskar et al., 2003]
Softmax-margin
[Gimpel & Smith, 2010]
22
Experimental Results


3 NLP problems; also synthetic data
We show that:



General CRFs work better when they match
dependencies in the data.
Minimum risk training results in more accurate
models.
ERMA software package available at
www.clsp.jhu.edu/~ves/software
23
ERMA software package
http://www.clsp.jhu.edu/~ves/software






Includes syntax for describing general CRFs.
Supports sum-product and max-product BP.
Can optimize several commonly used loss functions:
MSE, Accuracy, F-score.
The package is generic:
Little effort to model new problems.
About1-3 days to express each problem in our formalism.
24
Modeling Congressional Votes
First , I want to commend
the gentleman from
Wisconsin (Mr.
Sensenbrenner), the
chairman of the committee
on the judiciary , not just
for the underlying bill…
The ConVote corpus [Thomas et al., 2006]
25
Modeling Congressional Votes
First , I want to commend
the gentleman from
Wisconsin (Mr.
Sensenbrenner), the
chairman of the committee
on the judiciary , not just
for the underlying bill…
Yea
The ConVote corpus [Thomas et al., 2006]
26
Modeling Congressional Votes
Mr. Sensenbrenner
First , I want to commend
the gentleman from
Wisconsin (Mr.
Sensenbrenner), the
chairman of the committee
on the judiciary , not just
for the underlying bill…
Yea
Had it not been for the
heroic actions of the
passengers of United flight
93 who forced the plane
down over Pennsylvania,
congress's ability to serve
…
Yea
The ConVote corpus [Thomas et al., 2006]
27
Modeling Congressional Votes
Mr. Sensenbrenner
First , I want to commend
the gentleman from
Wisconsin (Mr.
Sensenbrenner), the
chairman of the committee
on the judiciary , not just
for the underlying bill…
Yea
Had it not been for the
heroic actions of the
passengers of United flight
93 who forced the plane
down over Pennsylvania,
congress's ability to serve
…
Yea
The ConVote corpus [Thomas et al., 2006]
28
Modeling Congressional Votes
• Predict representative votes based on debates.
Y/N
An example from the ConVote corpus
[Thomas et al., 2006]
29
Modeling Congressional Votes
• Predict representative votes based on debates.
Y/N
Text
First , I want to commend
the gentleman from
Wisconsin (Mr.
Sensenbrenner), the
chairman of the committee
on the judiciary , not just
for the underlying bill…
An example from the ConVote corpus
[Thomas et al., 2006]
30
Modeling Congressional Votes
• Predict representative votes based on debates.
Y/N
Y/N
Contex
t
Text
First , I want to commend
the gentleman from
Wisconsin (Mr.
Sensenbrenner), the
chairman of the committee
on the judiciary , not just
for the underlying bill…
Text
An example from the ConVote corpus
[Thomas et al., 2006]
31
Modeling Congressional Votes
Accuracy
Non-loopy baseline
(2 SVMs + min-cut)
71.2
32
Modeling Congressional Votes
Accuracy
Non-loopy baseline
(2 SVMs + min-cut)
71.2
Loopy CRF models
(inference via loopy sum-prod BP)
33
Modeling Congressional Votes
Accuracy
Non-loopy baseline
(2 SVMs + min-cut)
71.2
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
78.2
34
Modeling Congressional Votes
Accuracy
Non-loopy baseline
(2 SVMs + min-cut)
71.2
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
Softmax-margin
(loss-aware)
78.2
79.0
35
Modeling Congressional Votes
Accuracy
Non-loopy baseline
(2 SVMs + min-cut)
71.2
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
Softmax-margin
(loss-aware)
ERMA
(loss- and approximation-aware)
78.2
79.0
84.5
*Boldfaced results are
significantly better
than all others (p <
0.05)
36
Information Extraction from SemiStructured Text
What: Special Seminar
Who: Prof. Klaus Sutner
Computer Science Department, Stevens Institute of
Technology
Topic: "Teaching Automata Theory by Computer"
Date: 12-Nov-93
Time: 12:00 pm
Place: WeH 4623
Host: Dana Scott (Asst: Rebecca Clark x8-6737)
ABSTRACT: We will demonstrate the system "automata" that
implements finite state machines…
…
After the lecture, Prof. Sutner will be glad to demonstrate and
discuss the use of MathLink and his "automata" package
CMU Seminar Announcement
Corpus [Freitag, 2000]
37
Information Extraction from SemiStructured Text
What: Special Seminar
Who: Prof. Klaus Sutner
speaker
Computer Science Department, Stevens Institute of
Technology
Topic: "Teaching Automata Theory by Computer"
Date: 12-Nov-93
start time
Time: 12:00 pm
location
Place: WeH 4623
Host: Dana Scott (Asst: Rebecca Clark x8-6737)
ABSTRACT: We will demonstrate the system "automata" that
implements finitespeaker
state machines…
…
After the lecture, Prof. Sutner will be glad to demonstrate and
discuss the use of MathLink and his "automata" package
CMU Seminar Announcement
Corpus [Freitag, 2000]
38
Skip-Chain CRF for Info Extraction

Extract speaker, location, stime, and etime
from seminar announcement emails
O
S
S
S
Who:
Prof.
Klau
s
Sutne
r
……
……
S
S
O
Prof.
Sutne
r
will
CMU Seminar Announcement Corpus [Freitag, 2000]
Skip-chain CRF [Sutton and McCallum, 2005; Finkel et al., 2005]
39
Semi-Structured Information Extraction
F1
Non-loopy baseline
(linear-chain CRF)
Non-loopy baseline + ERMA
(trained for loss instead of likelihood)
86.2
87.1
40
Semi-Structured Information Extraction
F1
Non-loopy baseline
(linear-chain CRF)
Non-loopy baseline + ERMA
(trained for loss instead of likelihood)
86.2
87.1
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
89.5
41
Semi-Structured Information Extraction
F1
Non-loopy baseline
(linear-chain CRF)
Non-loopy baseline + ERMA
(trained for loss instead of likelihood)
86.2
87.1
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
Softmax-margin
(loss-aware)
89.5
90.2
42
Semi-Structured Information Extraction
F1
Non-loopy baseline
(Linear-chain CRF)
Non-loopy baseline + ERMA
(trained for loss instead of likelihood)
86.2
87.1
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
Softmax-margin
(loss-aware)
ERMA
(loss- and approximation-aware)
89.5
90.2
*Boldfaced results
are significantly
better than all
others (p < 0.05).
90.9
43
Collective Multi-Label Classification
The collapse of crude oil supplies from
Libya has not only lifted petroleum
prices, but added a big premium to oil
delivered promptly.
Before protests began in February
against Muammer Gaddafi, the price
of benchmark European crude for
imminent delivery was $1 a barrel less
than supplies to be delivered a year
later.
…
Oil
Libya
Sport
s
Reuters Corpus Version 2
[Lewis et al, 2004]
44
Collective Multi-Label Classification
The collapse of crude oil supplies from
Libya has not only lifted petroleum
prices, but added a big premium to oil
delivered promptly.
Before protests began in February
against Muammer Gaddafi, the price
of benchmark European crude for
imminent delivery was $1 a barrel less
than supplies to be delivered a year
later.
…
Oil
Libya
Sport
s
Reuters Corpus Version 2
[Lewis et al, 2004]
45
Collective Multi-Label Classification
The collapse of crude oil supplies from
Libya has not only lifted petroleum
prices, but added a big premium to oil
delivered promptly.
Before protests began in February
against Muammer Gaddafi, the price
of benchmark European crude for
imminent delivery was $1 a barrel less
than supplies to be delivered a year
later.
…
Oil
Libya
Sport
s
46
Collective Multi-Label Classification
The collapse of crude oil supplies from
Libya has not only lifted petroleum
prices, but added a big premium to oil
delivered promptly.
Before protests began in February
against Muammer Gaddafi, the price
of benchmark European crude for
imminent delivery was $1 a barrel less
than supplies to be delivered a year
later.
…
Oil
Libya
Sport
s
[Ghamrawi and McCallum, 2005;
Finley and Joachims, 2008]
47
Multi-Label Classification
F1
Non-loopy baseline
(logistic regression for each label)
81.6
48
Multi-Label Classification
F1
Non-loopy baseline
(logistic regression for each label)
81.6
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
84.0
49
Multi-Label Classification
F1
Non-loopy baseline
(logistic regression for each label)
81.6
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
Softmax-margin
(loss-aware)
84.0
83.8
50
Multi-Label Classification
F1
Non-loopy baseline
(logistic regression for each label)
81.6
Loopy CRF models
(inference via loopy sum-prod BP)
Maximum-likelihood training
(with approximate inference)
Softmax-margin
(loss-aware)
ERMA
(loss- and approximation-aware)
84.0
83.8
84.6
*Boldfaced results are
significantly better
than all others (p <
0.05)
51
Summary
Congressional
Vote Modeling
(Accuracy)
Semi-str. Inf.
Extraction
(F1)
Multi-label
Classification
(F1)
71.2
87.1
81.6
Maximum-likelihood
training
78.2
89.5
84.0
ERMA
84.5
90.9
84.6
Non-loopy baseline
Loopy CRF models
52
Synthetic Data

Generate a CRF at random

Random structure & parameters

Use Gibbs sampling to generate data

Forget the parameters
Optionally add noise to the structure




Learn the parameters from the sampled data
Evaluate using one of four loss functions
Total of 12 models of different size/connectivity
53
Synthetic Data: Results
Test Loss
MSE
Accuracy
F-Score
ApprLogL
Train Objective
Δ Loss compared
to true model
ApprLogL baseline
.71
MSE
. 05
ApprLogL baseline
. 75
Accuracy
.01
ApprLogL baseline
F-Score
ApprLogL baseline
wins/ties/losses
(over 12 models)
12/0/0
11/0/1
1.17
.08
10/2/0
-.31
54
Introducing Structure Mismatch
0.025
0.02
0.015
Loss
ALogL -- MSE
MSE -- MSE
ALogL -- F-score
0.01
F-score -- F-score
0.005
0
10%
20%
30%
40%
Structure Mismatch
55
(independently done by Domke 2010, 2011)
Back-Propagation of Error for
Empirical Risk Minimization
• Back propagation of error (automatic
differentiation in the reverse mode) to
compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to
find the θ* that (locally) minimizes the training loss.
x
Black box decision
function
parameterized by ϴ
L(y*,ŷ)
56
Back-Propagation of Error for
Empirical Risk Minimization
• Back propagation of error (automatic
differentiation in the reverse mode) to
compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to
find the θ* that (locally) minimizes the training loss.
x
L(y*,ŷ)
57
Back-Propagation of Error for
Empirical Risk Minimization
• Back propagation of error (automatic
differentiation in the reverse mode) to
compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to
find the θ* that (locally) minimizes the training loss.
x
Neural network
L(y*,ŷ)
58
Back-Propagation of Error for
Empirical Risk Minimization
• Back propagation of error (automatic
differentiation in the reverse mode) to
compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to
find the θ* that (locally) minimizes the training loss.
x
Neural network
L(y*,ŷ)
59
Back-Propagation of Error for
Empirical Risk Minimization
• Back propagation of error (automatic
differentiation in the reverse mode) to
compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to
find the θ* that (locally) minimizes the training loss.
Y2
x
Y1
Y3
Y4
CRF System
X1
X2
L(y*,ŷ)
X3
60
Error Back-Propagation
61
Error Back-Propagation
62
Error Back-Propagation
63
Error Back-Propagation
64
Error Back-Propagation
65
Error Back-Propagation
66
Error Back-Propagation
67
Error Back-Propagation
68
Error Back-Propagation
69
Error Back-Propagation
P(VoteReidbill77=Yea | x)
ϴ
VoteReidbill77
m(y1y2)=m(y3y1)*m(y4y1)
70
Error Back-Propagation
• Applying the differentiation chain rule over
and over.
• Forward pass:
– Regular computation (inference + decoding) in the
model (+ remember intermediate quantities).
• Backward pass:
– Replay the forward pass in reverse computing
gradients.
71
The Forward Pass
• Run inference and decoding:
Inference (loopy BP)
θ
messages
beliefs
Decoding
Loss
output
L
72
The Backward Pass
• Replay the computation backward calculating
gradients:
Inference (loopy BP)
θ
messages
beliefs
ð(θ)
ð(messages)
ð(beliefs)
Decoding
Loss
output
L
ð(output)
ð(L)=1
ð(f)= L/f
73
Gradient-Based Optimization
• Use a local optimizer to find θ* that minimize
training loss.
• In practice, we use a second-order method,
Stochastic Meta Descent (Schradoulph 1999).
– Some more automatic differentiation magic
needed to compute vector-Hessian products
(Pearlmutter 1994).
• Both gradient and vector-Hessian computation
have the same complexity as the forward pass
(small constant factor).
74
Deterministic Annealing
• Some loss functions are not differentiable (e.g.,
accuracy)
• Some inference methods are not differentiable
(e.g., max-product BP).
• Replace Max with Softmax and anneal.
75
Part 1:
Your favorite approximate inference algorithm is a trainable hack
Part 2:
What other trainable inference devices
can we devise?
Prediction device
Preferably can tune for
speed-accuracy tradeoff
(Horvitz 1989, “flexible computation”)
(suitable for domain)
1. Lookup methods



Hash tables
Memory-based learning
Dual-path models
(look up if possible, else do deeper inference)
(in general, dynamic mixtures of policies: Halpern & Pass 2010)
2. Choose Fast Model Structure

Static choice of fast model structure (Sebastiani & Ramoni 1998)




Learning a low-treewidth model
(e.g., Bach & Jordan 2001, Narasimhan & Bilmes 2004)
Learning a sparse model (e.g., Lee et al. 2007)
Learning an approximate arithmetic circuit (Lowd & Domingos 2010)
Dynamic choice of fast model structure

Dynamic feature selection (Dulac-Arnold et al., 2011; Busa-Fekete et al.,
2012; He et al., 2012; Stoyanov & Eisner, 2012)


Evidence-specific tree (Chechetka & Guestrin 2010)
Data-dependent convex optimization problem (Domke 2008, 2012)
3. Pruning Unlikely Hypotheses

Tune aggressiveness of pruning

Pipelines, cascades, beam-width selection

Classifiers or competitive thresholds

E.g., Taskar & Weiss 2010, Bodenstab et al. 2011
4. Pruning Work During Search

Early stopping

Message-passing inference (Stoyanov et al. 2011)
ERMA: Increasing Speed by Early
Stopping
(synthetic data)
0.035
0.03
Loss
0.025
0.02
ALogL -- MSE
0.015
MSE -- MSE
ALogL -- F-score
0.01
F-score -- F-score
0.005
0
100
30
20
10
Max BP Iterations
81
4. Pruning Work During Search

Early stopping before convergence



Message-passing inference (Stoyanov et al. 2011)
Agenda-based dynamic prog. (Jiang et al. 2012) – approximate A*!
Update some messages more often


In generalized BP, some messages are more complex
Order of messages also affects convergence rate



Cf. residual BP
Cf. flexible arithmetic circuit computation (Filardo & Eisner 2012)
Coarsen or drop messages selectively


Value of computation
Cf. expectation propagation (for likelihood only)
5. Sequential Decisions with Revision

Common to use sequential decision
processes for structured prediction

MaltParser, SEARN, etc.
Algorithm example (from Joakim Nivre)
P
ROOT
RA(
LA(
RA(
LA(
RNMOD
EDUCE
PMOD
RA(
SOBJ
SBJ
HIFT
P)
OBJ
NMOD
0
1
SBJ
2
Economic news
JJ
NN
PMOD
NMOD NMOD
3
had
VBD
4
5
6
NMOD
7
8
9
little effect on financial markets .
JJ
NN
IN
JJ
NNS
.
5. Sequential Decisions with Revision

Common to use sequential decision processes for
structured prediction


Often treated as reinforcement learning



MaltParser, SEARN, etc.
Cumulative or delayed reward
Try to avoid “contagious” mistakes
New opportunity:

Enhanced agent that can backtrack and fix errors



The flip side of RL lookahead! (only in a forgiving environment)
Sometimes can observe such agents (in psych lab)
Or widen its beam and explore in parallel
Open Questions

Effective algorithm that dynamically assesses
value of computation.

Theorems of the following form:
If true model comes from distribution P, then with high
probability there exists a fast/accurate policy in the policy
space. (better yet, find the policy!)

Effective policy learning methods.
On Policy Learning Methods …

Basically large reinforcement learning problems




Search in policy parameter space



Policy gradient (doesn’t work)
Direct search (e.g., Nelder-Mead)
Search in priority space


But rather strange ones! (Eisner & Daumé 2011)
Policy ( priorities)  trajectory  reward
Often, many equivalent trajectories will get the same answer
Need a surrogate objective, like A*
Search in trajectory space




SEARN (too slow for some controllers)
Loss-augmented inference (Chiang et al. 2009; McAllester et al. 2010)
Response surface methodology (really searches in policy space)
Integer linear programming
Part 1:
Your favorite approximate inference algorithm is a trainable hack
Part 2:
What other trainable inference devices can we devise?
Part 3:
Beyond ERMA to IRMA
Empirical Risk Minimization under Approximations
Imputed
Where does p(x, y) come from?
Loss
Optimized
parameters of
prediction rule
Data
distribution
Prediction
rule
Generative vs. discriminative




training data vs. dev data (Raina et al. 2003)
unsupervised vs. supervised data (McCallum et al. 2006)
regularization vs. empirical loss (Lasserre et al. 2006)
data distribution vs. decision rule (this work; cf. Lacoste-Julien 2011)
Optimized
parameters of
prediction rule
science
engineering
Data imputation (Little & Rubin 1987)




May need to “complete” missing data
What are we given?
How do we need to complete it?
How do we complete it?
Optimized
parameters of
prediction rule
science
engineering
1. Have plenty of inputs; impute outputs

“Model compression / uptraining / structure compilation”
 GMM -> VQ (Hedelin & Skoglund 2000)
 ensemble -> single classifier (Bucila et al. 2006)
 sparse coding -> regression or NN

(Kavukcuoglu et al., 2008; Jarrett et al., 2009; Gregor & LeCun, 2010)
CRF or PCFG -> local classifiers (Liang, Daume & Klein 2008)

latent-variable PCFG -> deterministic sequential parser
(Petrov et al. 2010)

sampling instead of 1-best
 [stochastic] local search -> regression (Boyan & Moore 2000)
 k-step planning in an MDP -> classification or k'-step planning

(e.g., rollout in Bertsekas 2005; Ross et al. 2011, DAgger)
BN -> arithmetic circuit (Lowd & Domingos, 2010)
2. Have good outputs; impute inputs


Distort inputs from input-output pairs

Abu-Mostafa 1995

SVMs can be regarded as doing this too!
Structured prediction: Impute possible missing
inputs

Impute many Chinese sentences that should translate
into each observed English sentence (Li et al., 2011)
3. Insufficient training data to impute well



Assumed that we have a good slow model at
training time
But what if we don’t?
Could sample from posterior over model
parameters as well …
4. Statistical Relational Learning

May only have one big incomplete training example!


Sample jointly from (model parameters, completions of the data)
Need a censorship model to mask data plausibly
Need a distribution over queries as well – query is part of (x,y) pair

What model should we use here?


Start with a “base MRF” to allow domain-specific inductive bias


But try to respect the marginals we can get good estimates of



some variables rarely observed; some values rarely observed
Want IRMA  ERMA as we get more and more training data
Need a high-capacity model to get consistency
Learn MRF close to base MRF? Use a GP based on the base MRF?
Summary: Speed-Accuracy Tradeoffs

Inference requires lots of computation

Is some computation going to waste?



Is some computation actively harmful?


Sometimes the best prediction is overdetermined …
Quick ad hoc methods sometimes work: how to respond?
In approximate inference, passing a message can hurt
Frustrating to simplify model just to fix this



Want to keep improving our models!
But need good fast approximate inference
Choose approximations automatically


Tuned to data distribution & loss function
“Trainable hacks” – more robust
Summary: Bayesian Decision Theory
Loss
Optimized
parameters of
prediction rule




Data
distribution
Prediction
rule
What prediction rule? (approximate inference + beyond)
What loss function?
(can include runtime)
How to optimize?
(backprop, RL, …)
What data distribution?
(may have to impute)
FIN
Current Collaborators
Undergrads
& junior grad
students
Katherine Wu Jay Feldman Frank Ferraro
Tim Vieira
Adam Teichert
Michael Paul
Mid to senior
grad students
Matt Gormley Nick Andrews
Henry Pao
Wes Filardo
Jason Smith
Ariya Rastrow
Faculty
Ves Stoyanov Ben Van Durme Mark Dredze Yanif Ahmad
Hal Daumé
René Vidal
(+ student) (+ 2 UMD students) (+ students)
NLP Tasks
15-20 years of introducing new formalisms, models & algorithms across NLP




Parsing
 Dependency, constituency, categorial, …
 Deep syntax
 Grammar induction
Word-internal modeling
 Morphology
 Phonology
 Transliteration
 Named entities
Translation
 Syntax-based (synchronous, quasi-synchronous, training, decoding)
Miscellaneous
 Tagging, sentiment, text cat, topics, coreference, web scraping …
 Generic algorithms on automata, hypergraphs, graphical models
Current Guiding Themes
Machine learning + linguistic structure.
Fashion statistical models that capture good intuitions about
various kinds of linguistic structure. Develop efficient
algorithms to apply these models to data. Be generic.
1.
2.
3.
Principled Bayesian models of various interesting NLP domains.

Discover underlying structure with little supervision

Requires new learning and inference algorithms
Learn fast, accurate policies
for structured prediction and large-scale relational reasoning.
Unified computational infrastructure for NLP and AI.

A declarative programming language that supports modularity

Backed by a searchable space of strategies & data structures
Fast but Principled Reasoning to Analyze Data

Principled:




Fast prediction:




New models suited to the data
+ new inference algorithms for those models
= draw appropriate conclusions from data
Inference algorithms
+ approximations trained to balance speed & acc.
= 80% of the benefit at 20% of the cost
Reusable frameworks for modeling & prediction
Word-Internal Modeling
Variation in a name within and across languages

E step: re-estimate distribution over all spanning trees


Requires: Corpus model with sequential generation, copying, mutation
M step: re-estimate name mutation model along likely tree edges

Required: Trainable parametric model of name mutation
Word-Internal Modeling
Variation in a name within and across languages
Word-Internal Modeling
Spelling of named entities

The “gazetteer problem” in NER systems




Using gazetteer features helps performance on in-gazetteer names.
But hurts performance on out-of-gazetteer names!
Spelling features essentially do not learn from the in-gazetteer
names.
Solution: Generate your gazetteer

Treat the gazetteer itself as training data for a generative model of
entity names.



Include this sub-model within a full NER model.



Includes spelling features.
Non-parametric model generates good results.
Not obvious how, especially for a discriminative NER model.
Can exploit additional gazetteer data, such as town population.
Problem & solution extend to other dictionary resources in NLP

Acronyms, IM-speak, cognate translations, …
Word-Internal Modeling
Inference over multiple strings

2011 dissertation by Markus Dreyer


Organize corpus tokens into morphological paradigms
Infer missing forms
String and sequence modeling
Optimal inference of strings

2011 dissertation by Markus Dreyer




Organize corpus types into morphological paradigms
Infer missing forms
Cool model – but exact inference is intractable, even undecidable
Dual decomposition to the rescue?

Will allow MAP inference in such models



Wasn’t obvious how to infer strings by dual decomposition


Message passing algorithm
If it converges, the answer is guaranteed correct
We have one technique and are working on others
So far, we’ve applied it to intersecting many automata


E.g, exact consensus output of ASR or MT systems
Usually converges reasonably quickly
String and sequence modeling
Optimal inference of strings
O(100*n*g) per iteration
Grammar Induction

Finding the “best” grammar is a horrible optimization problem


Even for overly simple definitions of “best”
Two new attacks:

Mathematical programming techniques




Branch and bound
+ Dantzig-Wolfe decomposition over the sentences
+ Stochastic local search
Deep learning




“Inside” and “outside” strings should depend on each other only
through a nonterminal (context-freeness)
CCA should be able to find that nonterminal (spectral learning)
But first need vector representations of inside and outside strings
So use CCA to build up representations recursively (deep learning)
Improved Topic Models
Results improve on the state of the art

What can we learn from distributional properties of words?

Some words group together into “topics.”




Tend to cooccur in documents; or have similar syntactic arguments.
But are there further hidden variables governing this?
Try to get closer to underlying meaning or discourse space.
Future: Embed words or phonemes in a structured feature
space whose structure must be learned
Applied NLP Tasks
Results improve on the state of the art

Add more global features to the model …



Need approximate inference, but it’s worth it
Especially if we train for the approximate inference condition
Within-document coreference

Build up properties of the underlying entities


Sentiment polarity


Exploit cross-document references that signal
(dis)agreement between two authors
Multi-label text categorization


Gender, number, animacy, semantic type, head word
Exploit correlations between labels on same document
Information extraction

Exploit correlations between labels on faraway words
Database generated websites
Database back-end
Post ID
Author
520
Demon
521
Ushi
Author
Title
Demon
Moderator
Ushi
Pink Space
Monkey
Author
Location
Demon
Pennsylvania
Ushi
Where else?
(...)
Web-page code produced by querying DB
(...)
112
Website generated databases*
Recovered database
Post ID
Author
520
Demon
521
Ushi
Author
Title
Demon
Moderator
Ushi
Pink Space
Monkey
Author
Location
Demon
Pennsylvania
Ushi
Where else?
* Thanks, Bayes!
Given web pages
We state a prior over annotated grammars
And a prior over database schemas
And a prior over database contents
113
Relational database  Webpages

Why isn’t this easy?



Could write a custom script …
… for every website in every language?? (and maintain it??)
Why are database-backed websites important?
1. Vast amounts of useful information are published this way! (most?)
2. In 2007, Dark Web project @ U. Arizona estimated 50,000
extremist/terrorist websites; fastest growth was in Web 2.0 sites

Some were transient sites, or subcommunities on larger sites
3. Our techniques could extend to analyze other semistructured docs

Why are NLP methods relevant?




Like NL, these webpages are meant to be read by humans
But they’re a mixture of NL text, tables, semi-structured data,
repetitive formatting …
Harvest NL text + direct facts (including background facts for NLP)
Helpful that HTML is a tree: we know about those 
114
Shopping & auctions (with user comments)
115
News articles & blogs...
116
...with user comments
117
Crime listings
118
Social networking
119
Collaborative encyclopedias
120
Linguistic resources (monolingual, bilingual)
121
Classified ads
122
Catalogs
123
Public records (in some countries)
Real estate, car ownership, sex
offenders, legal judgments, inmate
data, death records, census data,
map data, genealogy, elected
officials, licensed professionals …
http://www.publicrecordcenter.com
124
Public records (in some countries)
125
Directories of organizations (e.g., Yellow Pages)
Banks of the World
>> South Africa
>> Union Bank of
Nigeria PLC
126
Directories of people
127
Different types of structured fields
Explicit fields
Fields with internal structure
Iterated field
128
Forums, bulletin boards, etc.
129
Lots of structured & unstructured content
Author
Date of post
Title (moderator, member, ...)
Post
Geographic location of poster
130
Fast but Principled Reasoning to Analyze Data

Principled:




Fast prediction:




New models suited to the data
+ new inference algorithms for those models
= draw appropriate conclusions from data
Inference algorithms
+ approximations trained to balance speed & acc.
= 80% of the benefit at 20% of the cost
Reusable frameworks for modeling & prediction
ERMA
Empirical Risk Minimization under Approximations

Our pretty models are approximate




Our inference procedures are also approximate
Our decoding procedures are also approximate
Our training procedures are also approximate (non-Bayesian)
So let’s train to minimize loss in the presence of all
these approximations

Striking improvements on several real NLP tasks
(as well as a range of synthetic data)
Speed-Aware ERMA
Empirical Risk Minimization under Approximations

So let’s train to minimize loss in the presence of all these
approximations


Striking improvements on several real NLP tasks
(as well as a range of synthetic data)
Even better, let’s train to minimize loss + runtime

Need special parameters to control degree of approximation



How long to run? Which messages to pass? Which features to use?
Get substantial speedups at little cost to accuracy
Next extension: Probabilistic relational models



Learn to do fast approximate probabilistic reasoning about slots and
fillers in a knowledge base
Detect interesting facts, answer queries, improve info extraction
Generate plausible supervised training data – minimize imputed risk
Learned Dynamic Prioritization
More minimization of loss + runtime

Many inference procedures
take nondeterministic steps that refine current beliefs.





Graphical models: Which message to update next?
Parsing: Which constituent to extend next?
Parsing: Which build action, or should we backtrack & revise?
Should we prune, or narrow or widen the beam?
Coreference: Which clusters to merge next?
Learn a fast policy that decides what step to take next.


“Compile” a slow inference procedure into a fast one that is
tailored to the specific data distribution and task loss.
Hard training problem in order to make test fast.

We’re trying a bunch of different techniques.
Compressed Learning
Sublinear time

How do we do unsupervised learning on many
terabytes of data??

Can’t afford to do many passes over the dataset …
Throw away some data?


Might create bias. How do we know we’re not throwing away the
important clues?

Better: Summarize the less relevant data and try to learn
from the summary.

Google N-gram corpus
= a compressed version of the English web.

N-gram counts from 1 trillion words of text
135
Tagging isolated N-grams
Topics: Computers Biology
V
V
N
IN
DT
NN
NNS
IN
though most monitor lizards from
IN
NN
VB
NNS
IN
Oops, ambiguous.
For learning, would help to have the whole sentence.
136
Tagging N-grams in context
Topics: Computers Biology
V
V
N
IN
DT
NN
NNS
IN
… some will eat vegetables though most monitor lizards from Africa are carnivores …
Oops, ambiguous.
For learning, would help to have the whole sentence.
137
Tagging N-grams in context
Topics:
Computers
Biology
… he watches them up close though most monitor lizards from a distance …
IN
NN
VB
NNS
IN
Oops, ambiguous.
For learning, would help to have the whole sentence.
138
Extrapolating contexts …
Topics: Computers Biology
V
V
N
some
will
Africa
eat
vegetables
he
DT
NN
NNS
IN
Asia
are
carnivorous
though most monitor lizards from
watches
them
IN
carnivores
close
IN
NN
VB
up
watch
though most monitor lizards from
though most monitor lizards
vegetables though most monitor lizards
close though most monitor lizards
NNS
IN
a
distance
N
though most monitor lizards from
most monitor lizards from
most monitor lizards from Africa
most monitor lizards from Asia
most monitor lizards from a
139
Learning from N-grams
Topics: Computers Biology
V
V
N
some
will
Africa
eat
vegetables
he
IN
DT
NN
NNS
IN
carnivores
are
Asia
carnivorous
though most monitor lizards from
watches
close
IN
NN
VB
NNS
IN
a
distance
them
up
N
watch
52
though most monitor lizards from
133
though most monitor lizards
101 vegetables though most monitor lizards
32
close though most monitor lizards
52 though most monitor lizards from
most monitor lizards from
250
most monitor lizards from Africa
142
most monitor lizards from Asia
83
most monitor lizards from a
25
140
Fast but Principled Reasoning to Analyze Data

Principled:




Fast prediction:




New models suited to the data
+ new inference algorithms for those models
= draw appropriate conclusions from data
Inference algorithms
+ approximations trained to balance speed & acc.
= 80% of the benefit at 20% of the cost
Reusable frameworks for modeling & prediction
Dyna
A language for propagating and combining information

Each idea takes a lot of labor to code up.

We spend way too much “research” time building
the parts that we already knew how to build.




Coding natural variants on existing models/algorithms
Hacking together existing data sources and algorithms
Extracting outputs
Tuning data structures, file formats, computation order,
parallelization
What’s in a knowledge base?

Types

Observed facts

Derived facts


Inference rules (declarative)
Inference strategies (procedural)
Common architecture?

There’s not a single best way to represent
uncertainty or combine knowledge.
What do numeric “weights” represent in a reasoning system?








Probabilities (perhaps approximations or bounds)
Intermediate quantities used to compute probabilities
(in dynamic programming or message-passing)

Feature values
Potentials

Feature weights & other parameters
Priorities

Distances, similarities
Confidences

Margins
Activation levels
Event or sample counts  Regularization terms
Losses, risks, rewards  Partial derivatives

...
Common architecture?

There’s not a single best way to represent uncertainty
or combine knowledge.


Different underlying probabilistic models
Different approximate inference/decision algorithms




Depends on domain properties, special structure, speed needs …
Heterogeneous data, features, rules, proposal distributions …
Need ability to experiment, extend, and combine
But all of the methods share the same computational
needs.
Common architecture?


There’s not a single best way …
But all of the methods share the same needs.






Store data and permit it to be queried.
Fuse data – compute derived data using rules.
Propagate updates to data, parameters, or hypotheses.
Encapsulate data sources – both input data & analytics.
Sensitivity analysis (e.g., back-propagation for training).
Visualization of facts, changes, and provenance.
Common architecture?
2011 paper on encoding AI problems in Dyna:




2-3 lines:
4 lines:
11 lines:
6 lines:







+6 lines: With backtracking search
+6 lines: With branch-and-bound
6 lines:
3 lines:

Dijkstra’s algorithm
Feed-forward neural net
Bigram language model (Good-Turing backoff smoothing)
Arc-consistency constraint propagation
Loopy belief propagation
Probabilistic context-free parsing
+7 lines: PCFG rule weights via feature templates (toy example)
4 lines:
5 lines:
3 lines:
Value computation in a Markov Decision Process
Weighted edit distance
Markov chain Monte Carlo (toy example)
Common architecture?


There’s not a single best way …
But all of the methods share the same needs.







Store data and permit it to be queried.
Fuse data – compute derived data using rules.
Propagate updates to data, parameters, or hypotheses.
Encapsulate data sources – both input data & analytics.
Sensitivity analysis (e.g., back-propagation for training).
Visualization.
And benefit from the same optimizations.




Decide what is worth the time to compute (next).
Decide where to compute it (parallelism).
Decide what is worth the space to store (data, memos, indices).
Decide how to store it.
Common architecture?


Dyna is not a probabilistic database, a
graphical model inference package,
FACTORIE, BLOG, Watson, a homebrew
evidence combination system, ...
It provides the common infrastructure for these.


That’s where “all” the implementation effort lies.
But does not commit to any specific data model,
probabilistic semantics, or inference strategy.
Summary (again)
Machine learning + linguistic structure.
Fashion statistical models that capture good intuitions about
various kinds of linguistic structure. Develop efficient
algorithms to apply these models to data. Be generic.
1.
2.
3.
Principled Bayesian models of various interesting NLP domains.

Discover underlying structure with little supervision

Requires new learning and inference algorithms
Learn fast, accurate policies
for structured prediction and large-scale relational reasoning.
Unified computational infrastructure for NLP and AI.

A declarative programming language that supports modularity

Backed by a searchable space of strategies & data structures