CCMs - University of Illinois at Urbana–Champaign

Download Report

Transcript CCMs - University of Illinois at Urbana–Champaign

Constrained Conditional Models
Learning and Inference
Dan Roth
for
Information Extraction
and
Natural Language Understanding
Department of Computer Science
University of Illinois at Urbana-Champaign
With thanks to:
June 2009
Collaborators: Ming-Wei Chang, Dan Goldwasser, Vasin Punyakanok, Lev Ratinov,
ILPNLP
Workshop @
NAACL-HLT
Nick Rizzolo,
Mark Sammons,
Ivan
Titov, Scott Yih, Dav Zimak
Funding: ARDA, under the AQUAINT program
NSF: ITR IIS-0085836, ITR IIS-0428472, ITR IIS- 0085980, SoD-HCER-0613885
A DOI grant under the Reflex program; DHS
DASH Optimization (Xpress-MP)
Page 1
Constraints Conditional Models (CCMs)
 Informally:
 Everything that has to do with global constraints (and learning models)
 A bit more formally:
 We typically make decisions based on models such as:
arg maxy wT f ( y, x)
 With CCMs we make decisions based on models such as:
arg max y wT f ( y, x)   c d ( y,1C )
cC
 This is a global inference problem (you can solve it multiple ways)
 We do not dictate how models are learned.
 but we’ll discuss it and make suggestions
CCMs assign values to variables in the presence/guided by constraints
Page 2
Constraints Driven Learning and Decision Making

Why Constraints?


The Goal: Building a good NLP systems easily
We have prior knowledge at our hand


How can we use it?
Often knowledge can be injected directly and be used to
improve decision making
 guide learning
 simplify the models we need to learn


How useful are constraints?



Useful for supervised learning
Useful for semi-supervised learning
Sometimes more efficient than labeling data directly
Page 3
Make my day
Page 4
Learning and Inference
 Global decisions in which several local decisions play a role but
there are mutual dependencies on their outcome.
 E.g. Structured Output Problems – multiple dependent output variables
 (Main playground for these methods so far)
 (Learned) models/classifiers for different sub-problems
 In some cases, not all local models can be learned simultaneously
 Key examples in NLP are Textual Entailment and QA
 In these cases, constraints may appear only at evaluation time
 Incorporate models’ information, along with prior
knowledge/constraints, in making coherent decisions
 decisions that respect the local models as well as domain & context
specific knowledge/constraints.
Page 5
A process that maintains and
updates a collection of propositions
about the state of affairs.
Comprehension
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the
Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a poem about him. The poem was
printed in a magazine for others to read. Mr. Robin then wrote a book. He
made up a fairy tale land where Chris lived. His friends were animals. There
was a bear called Winnie the Pooh. There was also an owl and a young pig,
called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin
made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to
read about Christopher Robin and his animal friends. Most people don't know
he is a real person who is grown now. He has written two books of his own.
They tell what it is like to be famous.
1. Christopher Robin was born in England.
3. Christopher Robin’s dad was a magician.
2. Winnie the Pooh is a title of a book.
4. Christopher Robin must be at least 65 now.
This is an Inference Problem
Page 6
This Talk: Constrained Conditional Models
 A general inference framework that combines


Learning conditional models with using declarative expressive constraints
Within a constrained optimization framework

Formulate a decision process as a constrained optimization problem
Issues to attend to:
 Break up a complex problem into a set of sub-problems and require
 While
we formulate
the
as an
ILP problem,
Inference
components’
outcomes
to problem
be consistent
modulo
constraints
can be done multiple ways

Search; sampling; dynamic programming; SAT; ILP
 Has been shown useful in the context of many NLP problems
 SRL,
Summarization;
Co-reference;
Information
 The
focus is on joint
global inference

Transliteration
 Learning may or may not be joint.
Extraction;
[Roth&Yih04,07;
Punyakanok et.al 05,08; Chang et.al 07,08; Clarke&Lapata06,07;
 Decomposing models is often beneficial
Denise&Baldrige07;Goldwasser&Roth’08]
 Here: focus on Learning and Inference for Structured NLP
Problems
Page 7
Outline
 Constrained Conditional Models


Motivation
Examples
 Training Paradigms: Investigate ways for training models
and combining constraints



Joint Learning and Inference vs. decoupling Learning & Inference
Training with Hard and Soft Constrains
Guiding Semi-Supervised Learning with Constraints
 Examples



Semantic Parsing
Information Extraction
Pipeline processes
Page 8
Pipeline
Raw Data
POS Tagging

 Most problems are not single classification problems
Phrases
Semantic Entities
Parsing
WSD
Relations
Semantic Role Labeling
Conceptually, Pipelining is a crude approximation
Interactions occur across levels and down stream decisions often interact with
previous decisions.
 Leads to propagation of errors
 Occasionally, later stage problems are easier but cannot correct earlier errors.


But, there are good reasons to use pipelines
Putting everything in one basket may not be right
 How about choosing some stages and think about them jointly?

Page 9
Inference with General Constraint Structure
Improvement over
[Roth&Yih’04]
no inference: 2-5%
Recognizing Entities and Relations
x*
other
0.05
other
0.10
other
0.05
per
0.85
per
0.60
per
0.50
NonSequential
= argmax

c(x=v)
[x=v]
=
loc
loc
loc
0.10
0.30
0.45
x
= argmaxx c{E1 = per}·Key
x{E1Components:
= per} + c{E1 = loc}· x{E1 = loc}+…+
an objective
(Linear).
Dole ’sdown
wife, Elizabeth
, is a function
native of N.C.
c{R12 = 1)Write
spouse_of}· x{R12 = spouse_of} +…+ c{R12 = }· x{R12 = }
2)Write
as linear inequalities
E1 down constraints
E2
E3
Subject to ConstraintsR
R
12
23
irrelevant
irrelevant
0.05
irrelevant
spouse_of
spouse_of
0.45
0.45
spouse_of
born_in
born_in
0.50
born_in
0.10Some Questions:
How to guide the global
0.05
inference?
0.85Why not learn Jointly?
Models could be learned separately; constraints may come up only at decision time.
Page 10
Problem Setting

Random Variables Y:
y4
y1
C(y1,y4)
y5
y2
C(y2,y3,y6,y7,y8)
y6
y7
y3
y8
observations




Conditional Distributions P (learned by models/classifiers)
Constraints C– any Boolean function
defined over partial assignments (possibly: + weights W )
Goal: Find the “best” assignment
 The assignment that achieves the highest global performance.
This is an Integer Programming Problem
Y*=argmaxY PY (+ WC) subject to constraints C
Page 11
Formal Model
Penalty for violating
the constraint.
Subject to constraints
Weight Vector for
“local” models
A collection of Classifiers;
Log-linear models (HMM,
CRF) or a combination
(Soft) constraints
component
How far y is from
a “legal” assignment
How to solve?
How to train?
This is an Integer Linear Program
How to decompose the global
objective function?
Solving using ILP packages gives an
exact solution.
Search techniques are also possible
Should we incorporate constraints
in the learning process?
Page 12
Example: Semantic Role Labeling
Who did what to whom, when, where, why,…
I left my pearls to my daughter in my will .
[I]A0 left [my pearls]A1 [to my daughter]A2 [in my will]AM-LOC .




Special Case (structured output problem):
A0
Leaver
here, all the data is available at one time;
in general, classifiers might be learned
A1
Things left
from different sources, at different times,
A2
Benefactor
at different contexts.
AM-LOC
Location
Implications on training paradigms
I left my pearls to my daughter in my will .
Overlapping arguments
If A2 is present, A1
must also be present.
Page 13
Semantic Role Labeling (2/2)

PropBank [Palmer et. al. 05] provides a large human-annotated
corpus of semantic verb-argument relations.



Core arguments: A0-A5 and AA



It adds a layer of generic semantic labels to Penn Tree Bank II.
(Almost) all the labels are on the constituents of the parse trees.
different semantics for each verb
specified in the PropBank Frame files
13 types of adjuncts labeled as AM-arg

where arg specifies the adjunct type
Page 14
Identify Vocabulary
Algorithmic Approach


I left my nice pearls to her
[ [
[
[
[
]
] ]
]
]
Pruning [Xue&Palmer, EMNLP’04]
Argument Identifier

Binary classification (SNoW)
Classify argument candidates

Argument Classifier


candidate arguments
Identify argument candidates


I left my nice pearls to her
Multi-class classification (SNoW)
Inference



Inference over (old and
I left my nice pearls to her
[new)
[
[ Vocabulary
[
[
]I left
] my
] nice pearls] to her
]
Use the estimated probability distribution
given by the argument classifier
Use structural and linguistic constraints
Infer the optimal global output
I left my nice pearls to her
Page 15
I left my nice pearls to her
Inference

The output of the argument classifier often violates some
constraints, especially when the sentence is long.

Finding the best legitimate output is formalized as an
optimization problem and solved via Integer Linear
Programming.
[Punyakanok et. al 04, Roth & Yih 04;05;07]

Input:



The probability estimation (by the argument classifier)
Structural and linguistic constraints
Allows incorporating expressive (non-sequential)
constraints on the variables (the arguments types).
Page 16
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.5
0.05
0.15
0.1
0.15
0.15
0.1
0.6
0.1
0.05
0.05
0.05
0.05
0.05
0.7
0.2
0.6
0.05
0.05
0.3
0.15
0.2
0.2
0.1
0.2
Page 17
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.5
0.05
0.15
0.1
0.15
0.15
0.1
0.6
0.1
0.05
0.05
0.05
0.05
0.05
0.7
0.2
0.6
0.05
0.05
0.3
0.15
0.2
0.2
0.1
0.2
Page 18
Semantic Role Labeling (SRL)
I left my pearls to my daughter in my will .
0.5
0.05
0.15
0.1
0.15
0.15
0.1
0.6
0.1
0.05
0.05
0.05
One inference
problem for each
verb predicate.
0.05
0.05
0.7
0.2
0.6
0.05
0.05
0.3
0.15
0.2
0.2
0.1
0.2
Page 19
Integer Linear Programming Inference

For each argument ai


Goal is to maximize



Set up a Boolean variable: ai,t indicating whether ai is classified as t
 i score(ai = t ) ai,t
The Constrained Conditional Model is
completely decomposed during training
Subject to the (linear) constraints
If score(ai = t ) = P(ai = t ), the objective is to find the
assignment that maximizes the expected number of
arguments that are correct and satisfies the constraints.
Page 20
Constraints

No duplicate argument classes
a  POTARG x{a = A0}  1

Any Boolean rule can be encoded as
a linear constraint.
R-ARG
If there is an R-ARG phrase, there is an ARG
Phrase
 a2  POTARG , a  POTARG x{a = A0}  x{a2 = R-A0}

C-ARG

a2  POTARG ,
If there is an C-ARG phrase, there is an ARG before it
 (a  POTARG)  (a is before a2 ) x{a = A0}  x{a2 = C-A0} Universally quantified
Many other possible constraints:




LBJ: allows a developer
rulesto encode
constraints in FOL; these are
Unique labels
compiled into linear inequalities
No overlapping or embedding
automatically.
Relations between number of arguments; order constraints
If verb is of type A, no argument of type B
Joint inference can be used also to combine different SRL Systems.
Page 21
Learning Based Java (LBJ):
http://L2R.cs.uiuc.edu/~cogcomp/software.php
A modeling language for Constrained Conditional Models
 Supports programming along with building learned models, high
level specification of constraints and inference with constraints
 Learning operator:



Integrated constraint language:


Functions defined in terms of data
Learning happens at “compile time”
Declarative, FOL-like syntax defines constraints in terms of your Java
objects
Compositionality:

Use any function as feature extractor

Easily combine existing model specifications /learned models with
each other
Page 22
Example: Semantic Role Labeling
LBJ site provides
example code for NER,
POS tagger etc.
Declarative, FOL-style
constraints written in
terms of functions
applied to Java objects
[Rizzolo, Roth’07]
Inference produces
new functions that
respect the constraints
Page 23
Semantic Role Labeling
Screen shot from a CCG demo
http://L2R.cs.uiuc.edu/~cogcomp
Semantic parsing reveals several relations
in the sentence along with their
arguments.
This approach produces a very
good semantic parser. F1~90%
Easy and fast: ~7 Sent/Sec
(using Xpress-MP)
Top ranked system in CoNLL’05
shared task
Key difference is the Inference
Page 24
Textual Entailment
Semantic Role Labeling
Punyakanok et. al’05,08
Phrasal verb paraphrasing
[Connor&Roth’07]
Inference for Entailment
Braz et. al’05, 07
Entity matching [Li et. al,
AAAI’04, NAACL’04]
Is it true that…?
market(Textual Entailment)
Eyeing the huge
potential, currently led by
Google, Yahoo took over
search company
Overture Services Inc. last
year

Yahoo acquired Overture
Overture is a search company
Google is a search company
Google owns Overture
……….
Page 25
Outline
 Constrained Conditional Models


Motivation
Examples
 Training Paradigms: Investigate ways for training models
and combining constraints



Joint Learning and Inference vs. decoupling Learning & Inference
Training with Hard and Soft Constrains
Guiding Semi-Supervised Learning with Constraints
 Examples



Semantic Parsing
Information Extraction
Pipeline processes
Page 26
Training Paradigms that Support Global Inference

Algorithmic Approach: Incorporating general constraints



Allow both statistical and expressive declarative constraints [ICML’05]
Allow non-sequential constraints (generally difficult)
[CoNLL’04]
Coupling vs. Decoupling Training and Inference.




Incorporating global constraints is important but
Should it be done only at evaluation time or also at training time?
How to decompose the objective function and train in parts?
Issues related to:


Modularity, efficiency and performance, availability of training data
Problem specific considerations
Page 27
Training in the presence of Constraints
Decompose Model (SRL case)

Decompose Model from constraints
General Training Paradigm:



First Term: Learning from data (could be further decomposed)
Second Term: Guiding the model by constraints
Can choose if constraints’ weights trained, when and how, or taken
into account only in evaluation.
Page 28
Comparing Training Methods

Option 1: Learning + Inference (with Constraints)


Ignore constraints during training
Option 2: Inference (with Constraints) Based Training

Consider constraints during training

In both cases: Global Decision Making with Constraints

Question: Isn’t Option 2 always better?

Not so simple…

Next, the “Local model story”
Page 29
Cartoon: each model
can be more complex
and may have a view
on a set of output
variables.
Training Methods
Learning + Inference (L+I)
Learn models independently
y1
y3
f1 ( x )
Inference Based Training (IBT)
Learn all models together!
Intuition
Learning with
constraints may make
learning more difficult
y2
y4
f2(x)
f3(x)
x3
x1
x4
y5
Y
f4(x)
f5(x)
x5
x2
x6
x7 X
Page 30
Training with Constraints
Example: Perceptron-based Global Learning
True Global Labeling
Local Predictions
Apply
Constraints:
Y
-1
1
-1
-1
1
Y’
-1
1
1
-1
1
1
f1 ( x )
X
f2(x)
x3
x1
f3(x)
x4
x5
x2
x6
x7
f 4( x )
Y
f5(x)
Which one is better?
When and Why?
Page 31
Claims [Punyakanok et. al , IJCAI 2005]

When the local modes are “easy” to learn, L+I outperforms IBT.


In many applications, the components are identifiable and easy to learn (e.g.,
argument, open-close, PER).
Only when the local problems become difficult to solve in isolation, IBT
outperforms L+I, but needs a larger number of training examples.
L+I: cheaper computationally; modular
IBT is better in the limit, and other extreme cases.


Other training paradigms are possible
Pipeline-like Sequential Models: [Roth, Small, Titov: AI&Stat’09]
Identify a preferred ordering among components
 Learn k-th model jointly with previously learned models

Page 32
Bound Prediction
L+I vs. IBT: the more identifiable
individual problems are, the better
overall performance is with L+I

Local  ≤ opt + ( ( d log m + log 1/ ) / m )1/2

Global
Indication for
hardness of
problem
 ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m )1/2
Bounds
Simulated Data
opt
=0.1
=0
opt
opt=0.2
Page 33
Relative Merits: SRL
L+I is better.
When the problem
is artificially made
harder, the tradeoff
is clearer.
hard
Difficulty of the learning problem
(# features)
easy
Page 34
Comparing Training Methods (Cont.)
Decompose Model (SRL case)

Local Models (train independently) vs. Structured Models



Decompose Model from constraints
In many cases, structured models might be better due to expressivity
But, what if we use constraints?
Local Models+Constraints vs. Structured Models +Constraints



Hard to tell: Constraints are expressive
For tractability reasons, structured models have less expressivity than
the use of constraints.
Local can be better, because local models are easier to learn
Page 35
Example: CRFs are CCMs
But, you can do better
Consider a common model for sequential inference: HMM/CRF



Inference in this model is done via
the Viterbi Algorithm.
y y1 y2 y3 y4 y5
x x1 x2 x3 x4 x5
A A A A A
s B B B B B
t
C C C C C
Viterbi is a special case of the Linear Programming based
Inference.


Viterbi is a shortest path problem, which is a LP, with a canonical matrix
that is totally unimodular. Therefore, you can get integrality
constraints for free.
One can now incorporate non-sequential/expressive/declarative
constraints by modifying this canonical matrix


No value can appear twice; a specific value must appear at least once; AB
And, run the inference as an ILP inference.
Learn a rather simple model; make decisions with a more expressive model
Page 36
Example: Semantic Role Labeling Revisited
s
A
A
A
A
A
B
B
B
B
B
C
C
C
C
C
Sequential Models
t
Local Models
Conditional Random Field
Logistic Regression
Global perceptron
Local Avg. Perceptron
Training: sentence based
Training: token based.
Testing: find the shortest path
Testing: find the best assignment locally
with constraints
with constraints
Page 37
Which Model is Better? Semantic Role Labeling

Experiments on SRL: [Roth and Yih, ICML 2005]

Story: Inject constraints into conditional random field models
Sequential Models
L+I
IBT
Model
Baseline
CRF
66.46
CRF-D
CRF-IBT
69.14
Local
L+I
Avg. P
58.15
+ Constraints
71.94
73.91
69.82
74.49
Training
38 than
145Models
Local
Models48
are noware
better
Sequential
Models!
Sequential
Models
better
than
Local
! 0.8
Time
(With
constraints)
(No constraints)
Page 38
Summary: Training Methods

Many choices for training a CCM
 Learning + Inference (Training without constraints)
 Inference based Learning (Training with constraints)
 Model Decomposition

Advantages of L+I
Require fewer training examples
 More efficient; most of the time, better performance
 Modularity; easier to incorporate already learned models.


Advantages of IBT
 Better
in the limit
 Better when there are strong interactions among y’s
Learn a rather simple model; make decisions with a more expressive model
Page 39
Outline
 Constrained Conditional Models


Motivation
Examples
 Training Paradigms: Investigate ways for training models
and combining constraints



Joint Learning and Inference vs. decoupling Learning & Inference
Training with Hard and Soft Constrains
Guiding Semi-Supervised Learning with Constraints
 Examples



Semantic Parsing
Information Extraction
Pipeline processes
Page 40
Constrained Conditional Model: Soft Constraints
Constraint violation penalty
Subject to constraints
(1) Why use soft constraints?
(Soft) constraints
component
How far y is from
a “legal” assignment
(2) How to model “degree of violations”
(3) How to solve?
(4) How to train?
This is an Integer Linear Program
How to decompose the global
objective function?
Solving using ILP packages gives an
exact solution.
Search techniques are also possible
Should we incorporate constraints
in the learning process?
Page 41
(1) Why Are Soft Constraints Important?

Some constraints may be violated by the data.

Even when the gold data violates no constraints, the model
may prefer illegal solutions.


If all solutions considered by the model violate constraints, we still
want to rank solutions based on the level of constraints’ violation.
Important when beam search is used


Rather than eliminating illegal assignments, re-rank them
Working with soft constraints [Chang et. al, ACL’07]

Need to define the degree of violation


Maybe be problem specific
Need to assign penalties for constraints
Page 42
Information extraction without Prior Knowledge
Lars Ole Andersen . Program analysis and specialization for the
C Programming language. PhD thesis. DIKU ,
University of Copenhagen, May 1994 .
Prediction result of a trained HMM
[AUTHOR]
[TITLE]
[EDITOR]
[BOOKTITLE]
[TECH-REPORT]
[INSTITUTION]
[DATE]
Lars Ole Andersen . Program analysis and
specialization for the
C
Programming language
. PhD thesis .
DIKU , University of Copenhagen , May
1994 .
Violates lots of natural
constraints!
Page 43
Examples of Constraints

Each field must be a consecutive list of words and can appear
at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE.
Four digits starting with 20xx and 19xx are DATE.
Quotations can appear only in TITLE
Easy to express pieces of “knowledge”
…….



Non Propositional; May use Quantifiers
Page 44
Information Extraction with Constraints


Adding constraints, we get correct results!
 Without changing the model
[AUTHOR]
[TITLE]
[TECH-REPORT]
[INSTITUTION]
[DATE]
Lars Ole Andersen .
Program analysis and specialization for the
C Programming language .
PhD thesis .
DIKU , University of Copenhagen ,
May, 1994 .
Page 45
Hard Constraints vs. Weighted Constraints
Constraints are close to
perfect
Labeled data might not
follow the constraints
Page 46
Training with Soft Constraints

Need to figure out the penalty as well…

Option 1: Learning + Inference (with Constraints)
 Learn the weights and penalties separately

Penalty(c) = -log{P(C is violated)}

Option 2: Inference (with Constraints) Based Training
 Learn the weights and penalties together

The tradeoff between L+I and IBT is similar to what we saw
earlier.
Page 47
Inference Based Training With Soft Constraints
• Example: Perceptron
• Update penalties as well !
For each iteration
For each (X, YGOLD ) in the training data
YPRED=
If YPRED != YGOLD
λ = λ + F(X, YGOLD ) - F(X, YPRED)
ρI = ρI+ d(YGOLD,1C(X)) - d(YPRED,1C(X)), I = 1,..
endif
endfor
Page 48
L+I vs IBT for Soft Constraints

Test on citation recognition:




L+I: HMM + weighted constraints
IBT: Perceptron + weighted constraints
Same feature set
With constraints
Without
constraints


Factored
Model
is better
Few
labeled
examples,
HMM > perceptron
More significant
with a small
# of examples
Many
labeled examples,
perceptron
> HMM
Page 49
Outline
 Constrained Conditional Models


Motivation
Examples
 Training Paradigms: Investigate ways for training models
and combining constraints



Joint Learning and Inference vs. decoupling Learning & Inference
Training with Hard and Soft Constrains
Guiding Semi-Supervised Learning with Constraints
 Examples



Semantic Parsing
Information Extraction
Pipeline processes
Page 50
Outline
 Constrained Conditional Models


Motivation
Examples
 Training Paradigms: Investigate ways for training models
and combining constraints


Joint Learning and Inference vs. decoupling Learning & Inference
Guiding Semi-Supervised Learning with Constraints
 Features vs. Constraints

Hard and Soft Constraints
 Examples



Semantic Parsing
Information Extraction
Pipeline processes
Page 51
Constraints As a Way To Encode Prior Knowledge

Consider encoding the knowledge that:


The “Feature” Way


Entities of type A and B cannot occur simultaneously in a sentence
Requires larger models
The Constraints Way



Need more training data
A effective way to inject
knowledge
Keeps the model simple; add expressive constraints directly
A small set of constraints
Allows for decision time incorporation of constraints
We can use constraints as a way to replace training data
Page 52
Guiding Semi-Supervised Learning with Constraints


In traditional Semi-Supervised learning the model can
drift away from the correct one.
Constraints can be used to generate better training data
 At decision time, to bias the objective function
towards favoring constraint satisfaction.
 At training to improve labeling of un-labeled data
(and thus improve the model)
Model
Decision Time
Constraints
Constraints
Un-labeled Data
Page 53
Semi-supervised Learning with Constraints
[Chang, Ratinov, Roth, ACL’07;ICML’08]
Supervised
learning algorithm
parameterized by 
Inference based
augmentation of the
training set (feedback)
(inference with constraints).
=learn(T)
For N iterations do
T=
For each x in unlabeled dataset
{y1,…,yK} InferenceWithConstraints(x,C, )
T=T  {(x, yi)}i=1…k
=  +(1- )learn(T)
Learn from new training data.
Weigh supervised and
unsupervised model.
Page 54
Value of Constraints in Semi-Supervised Learning
Objective function:
Learning w/o Constraints: 300 examples.
Learning w 10 Constraints
Constraints are used to
Bootstrap a semisupervised learner
Poor model + constraints
used to annotate
unlabeled data, which in
turn is used to keep
training the model.
Factored model.
# of available labeled examples
Page 55
Constraints in a hidden layer
Hard to find
constraints!?
Single Output Problem:
Only one output
y1
x3
x1
Y
x4
x5
x2
x6
Intuition: introduce
structural hidden
variables
x7
X
Page 56
56
Adding Constraints Through Hidden Variables
Single Output Problem
y1
with hidden variables
f1
Y
f2
f3
f4
f5
x3
x1
x4
x5
x2
x6
Use constraints
to capture the
dependencies
x7
X
Page 57
57
Learning Good Feature Representation for
Discriminative Transliteration
(‫איטליה‬,Italy) 
I
t a l y
features
Yes/No
‫ה י ל ט י א‬

Learning feature representation is a structured learning problem
Features are the graph edges – the problem is choosing the optimal subset
 Many constraints on the legitimacy of the active features representation
 Formalize the problem as a constrained optimization problem


A successful solution depends on:
X *  arg max
w x
ij
ij
iS , jT
Subject to:
One-to-One mapping; Noncrossing
Length difference restriction
Language specific constraints
X
Learning a good
objective function
Iterative Unsupervised
learning algorithm
Good initial
objective function
 Romanization table
Page 58
Iterative Objective Function Learning
Generate features
Inference
Prediction
Initial objective
function
Romanization
Table
Predict labels for
all word pairs
Update weight
vector
Training
Language pair
UCDL
Prev. Sys
English-Russian (ACC)
73
63
English-Hebrew (MRR)
89.9
51
Page 59
Summary: Constrained Conditional Models
Conditional Markov Random Field
y1
y4
y2
y5
y6
Constraints Network
y3
y7
y1
y8
y4

Linear objective functions
Typically Á(x,y) will be local
functions, or Á(x,y) = Á(x)


y5
y6
y3
y7
y8
- i ½i dC(x,y)
y* = argmaxy  wi Á(x; y)

y2



Expressive constraints over output
variables
Soft, weighted constraints
Specified declaratively as FOL formulae
Clearly, there is a joint probability distribution that represents
this mixed model. Key difference from MLNs, which provide a concise
We would like to: definition of a model, but the whole joint one.


Learn a simple model or several simple models
Make decisions with respect to a complex model
Page 60
Conclusion

Constrained Conditional Models combine



Learning conditional models with using declarative expressive constraints
Within a constrained optimization framework
Use constraints! The framework supports:

A clean way of incorporating constraints to bias and improve decisions
of supervised learning models



Significant success on several NLP and IE tasks (often, with ILP)
A clean way to use (declarative) prior knowledge to guide semisupervised learning
Training
protocol
matters
LBJ
(Learning
Based Java): http://L2R.cs.uiuc.edu/~cogcomp
More work
needed here
Amodeling
language
for Constrained Conditional Models. Supports
programming along with building learned models, high level
specification of constraints and inference with constraints
Page 61
Nice to Meet You
Page 62