SRL via Generalized Inference
Download
Report
Transcript SRL via Generalized Inference
Multiclass Classification in NLP
Name/Entity Recognition
Label people, locations, and organizations in a sentence
[PER Sam Houston],[born in] [LOC Virginia], [was a member of the]
[ORG US Congress].
Decompose into sub-problems
Sam Houston, born in Virginia...
Sam Houston, born in Virginia...
Sam Houston, born in Virginia...
(PER,LOC,ORG,?) PER (1)
(PER,LOC,ORG,?) None (0)
(PER,LOC,ORG,?) LOC (2)
Many problems in NLP are decomposed this way
Disambiguation tasks
POS Tagging
Word-sense disambiguation
Verb Classification
Semantic-Role Labeling
Page 1
Outline
Multi-Categorical Classification Tasks
Decomposition Approaches
Constraint Classification
example: Semantic Role Labeling (SRL)
Unifies learning of multi-categorical classifiers
Structured-Output Learning
revisit SRL
Decomposition versus Constraint Classification
Goal:
Discuss multi-class and structured output from the same
perspective.
Discuss similarities and differences
Page 2
Multi-Categorical Output Tasks
Multi-class Classification (y {1,...,K})
character recognition (‘6’)
document classification (‘homepage’)
Multi-label Classification (y {1,...,K})
document classification (‘(homepage,facultypage)’)
Category Ranking (y K)
user preference (‘(love > like > hate)’)
document classification (‘hompage > facultypage > sports’)
Hierarchical Classification (y {1,..,K})
cohere with class hierarchy
place document into index where ‘soccer’ is-a ‘sport’
Page 3
(more) Multi-Categorical Output Tasks
Sequential Prediction (y {1,...,K}+)
e.g. POS tagging (‘(NVNNA)’)
“This is a sentence.” D V N D
e.g. phrase identification
Many labels: KL for length L sentence
Structured Output Prediction (y C({1,...,K}+))
e.g. parse tree, multi-level phrase identification
e.g. sequential prediction
Constrained by
domain, problem, data, background knowledge, etc...
Page 4
Semantic Role Labeling
A Structured-Output Problem
For each verb in a sentence
Identify all constituents that fill a semantic role
Determine their roles
1.
2.
•
•
Core Arguments, e.g., Agent, Patient or Instrument
Their adjuncts, e.g., Locative, Temporal or Manner
A0 : leaver
A2 : benefactor
I left my pearls to my daughter-in-law in my will.
A1 : thing left
AM-LOC
Page 5
Semantic Role Labeling
A0 -
A1
A2
AM-LOC
I left my pearls to my daughter-in-law in my will.
Many possible valid output
Many possible invalid output
Page 6
Structured Output Problems
Multi-Class
View y=4 as (y1,...,yk) = ( 0 0 0 1 0 0 0 )
The output is restricted by “Exactly one of yi=1”
Learn f1(x),..,fk(x)
Sequence Prediction
e.g. POS tagging: x = (My name is Dav) y = (Pr,N,V,N)
e.g. restriction: “Every sentence must have a verb”
Structured Output
Arbitrary global constraints
Local functions do not have access to global constraints!
Goal:
Discuss multi-class and structured output from the same perspective.
Discuss similarities and differences
Page 7
Transform the sub-problems
Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1)
Transform each problem to feature vector
Sam Houston, born in Virginia
(Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN, --BORN,... )
( 0 , 0 ,
1
, 0
, 1
,
1 ,... )
Transform each label to a class label
PER
LOC
ORG
?
Input :
Output:
1
2
3
0
{0,1}d or Rd
{0,1,2,3,...,k}
Page 8
Solving multiclass with binary learning
Multiclass classifier
Function
f : Rd {1,2,3,...,k}
Decompose into binary problems
Not always possible to learn
No theoretical justification (unless the problem is easy)
Page 9
The Real MultiClass Problem
General framework
Extend binary algorithms
Theoretically justified
Provably correct
Generalizes well
Verified Experimentally
Naturally extends binary classification algorithms to
mulitclass setting
e.g. Linear binary separation induces linear boundaries in
multiclass setting
Page 10
Multi Class over Linear Functions
– One versus all (OvA)
– All versus all (AvA)
– Direct winner-take-all (D-WTA)
Page 11
WTA over linear functions
Assume examples generated from winner-take-all
y = argmax wi . x + ti
wi, x Rn , ti R
• Note: Voronoi diagrams are WTA functions
argminc || ci – x || = argmaxc ci . x – ||ci||2 / 2
Page 12
Learning via One-Versus-All (OvA)
Assumption
Find vr,vb,vg,vy Rn such that
vr.x > 0 iff y = red
vb.x > 0 iff y = blue
vg.x > 0 iff y = green
vy.x > 0 iff y = yellow
Classifier f(x) = argmax vi.x
H = Rkn
Individual
Classifiers
Decision
Regions
Page 13
Learning via All-Verses-All (AvA)
Assumption
Find vrb,vrg,vry,vbg,vby,vgy Rd such that
vrb.x > 0 if y = red
< 0 if y = blue
vrg.x > 0 if y = red
< 0 if y = green
... (for all pairs)
H = Rkkn
How to
classify?
Individual
Classifiers
Decision
Regions
Page 14
Classifying with AvA
Tree
Majority Vote
1 red, 2 yellow, 2 green
?
Tournament
All are post-learning and might cause weird stuff
Page 15
Summary (1): Learning Binary Classifiers
On-Line: Perceptron, Winnow
SVM
Mistake bounded
Generalizes well (VC-Dim)
Works well in practice
Well motivated to maximize margin
Generalizes well
Works well in practice
Boosting, Neural Networks, etc...
Page 16
From Binary to Multi-categorical
Decompose multi-categorical problems
into multiple (independent) binary problems
Multi-class: OvA, AvA, ECOC, DT, etc...
Multi-label: reduce to multi-class
Categorical Ranking: reduce or regression
Sequence Prediction:
Reduce to Multi-class
part/alphabet based decompositions
Structured Output:
learn parts of output based on local information!!!
Page 17
Problems with Decompositions
Learning optimizes over local metrics
Poor global performance
Poor decomposition poor performance
What is the metric?
We don’t care about the performance of the local classifiers
Difficult local problems
Irrelevant local problems
Not clear how to decompose all Multi-category
problems
Page 18
Multi-class OvA Decomposition: a Linear
Representation
Hypothesis: h(x) = argmaxi vix
Decomposition
Learning: One-versus-all (OvA)
Each class represented by a linear function vix
For each class i vix > 0 iff i=y
General Case
Each class represented by a function fi(x) > 0
Page 19
Learning via One-Versus-All (OvA)
Assumption
Classifier f(x) = argmax vi.x
Individual
Classifiers
OvA Learning: Find vi.x > 0 iff y=i
OvA is fine only if data is OvA separable!
Linear classifier can represent this function!
(voronoi) argmin d(ci,x) (wta) argmax cix + di
Page 20
Other Issues we Mentioned
Error Correcting Output Codes
Another (class of) decomposition
Difficulty: how to make sure that the resulting problems are separable.
Commented on the advantage of All vs. All when working with the
dual space (e.g., kernels)
Page 21
Example: SNoW Multi-class Classifier
How do we train?
Targets (each an LTU)
How do we
evaluate?
Weighted edges (weight
vectors)
Features
SNoW only represents the targets and weighted edges
Page 22
Winnow: Extensions
Winnow learns monotone boolean functions
To learn non-monotone boolean functions:
For each variable x, introduce x’ = ¬x
Learn monotone functions over 2n variables
To learn functions with real valued inputs:
“Balanced Winnow”
2 weights per variable; effective weight is the difference
Update rule:
If [(w w ) x ] y, wi wi r y xi , wi wi ry xi
Page 23
An Intuition: Balanced Winnow
In most multi-class classifiers you have a target node that represents
positive examples and target node that represents negative
examples.
Typically, we train each node separately (my/not my example).
Rather, given an example we could say: this is more a + example
than a – example.
If [(w w ) x ] y, wi wi r y xi , wi wi ry xi
We compared the activation of the different target nodes (classifiers)
on a given example. (This example is more class + than class -)
Page 24
Constraint Classification
Can be viewed as a generalization of the balanced Winnow to
the multi-class case
Unifies multi-class, multi-label, category-ranking
Reduces learning to a single binary learning task
Captures theoretical properties of binary algorithm
Experimentally verified
Naturally extends Perceptron, SVM, etc...
Do all of this by representing labels as a set of constraints or
preferences among output labels.
Page 25
Multi-category to Constraint
Classification
Multiclass
(x, (A>B, A>C, A>D) )
Multilabel
(x, A)
(x, (A, B))
(x, ( (A>C, A>D, B>C, B>D) )
Label Ranking
(x, (5>4>3>2>1)) (x, ( (5>4, 4>3, 3>2, 2>1) )
Examples (x,y)
y Sk
Sk : partial order over class labels {1,...,k}
defines “preference” relation ( > ) for class labeling
Constraint Classifier
h: X Sk
Page 26
Learning Constraint Classification
Kesler Construction
Transform Examples
2>1
2>3
2>1
2>3
2>4
i>j
2>4
fi(x) - fj(x)
>0
wi x - wj x > 0
W Xi - W Xj > 0
W (Xi - Xj) > 0
W Xij
>0
Xi = (0,x,0,0) Rkd
Xj = (0,0,0,x) Rkd
Xij = Xi - Xj = (0,x,0,-x)
W = (w1,w2,w3,w4) Rkd
Page 27
Kesler’s Construction (1)
y = argmaxi=(r,b,g,y) vi.x
vi , x Rn
Find vr,vb,vg,vy Rn such that
vr.x > vb.x
vr.x > vg.x
vr.x > vy.x
H = Rkn
Page 28
Kesler’s Construction (2)
Let v = (vr,vb,vg,vy ) Rkn
Let 0n, be the n-dim zero vector
x -x
-x x
vr.x > vb.x v.(x,-x,0n,0n) > 0 v.(-x,x,0n,0n) < 0
vr.x > vg.x v.(x,0n,-x,0n) > 0 v.(x,0n,-x,0n) < 0
vr.x > vy.x v.(x,0n,0n,-x) > 0 v.(-x,0n,0n ,x) < 0
Page 29
Kesler’s Construction (3)
Let
v = (v1, ..., vk) Rn x ... x Rn = Rkn
xij = (0(i-1)n, x, 0(k-i)n) – (0(j-1)n, –x, 0(k-j)n) Rkn
x
Given (x, y) Rn x {1,...,k}
For all j y
-x
Add to P+(x,y), (xyj, 1)
Add to P-(x,y), (–xyj, -1)
P+(x,y) has k-1 positive examples ( Rkn)
P-(x,y) has k-1 negative examples ( Rkn)
Page 30
Learning via Kesler’s Construction
Given (x1, y1), ..., (xN, yN) Rn x {1,...,k}
Create
P+ = P+(xi,yi)
P– = P–(xi,yi)
Find v = (v1, ..., vk) Rkn, such that
v.x separates P+ from P–
Output
f(x) = argmax vi.x
Page 31
Constraint Classification
Examples (x,y)
y Sk
Sk : partial order over class labels {1,...,k}
defines “preference” relation (<) for class labels
e.g. Multiclass:
2<1, 2<3, 2<4, 2<5
e.g. Multilabel:
1<3, 1<4, 1<5, 2<3, 2<4, 4<5
Constraint Classifier
f: X Sk
f(x) is a partial order
f(x) is consistent with y if (i<j) y (i<j) f(x)
Page 32
Implementation
Examples (x,y)
y Sk
Sk : partial order over class labels {1,...,k}
defines “preference” relation (>) for class labels
e.g. Multiclass:
2>1, 2>3, 2>4, 2>5
Given an example that is labeled 2, the activation of target 2 on it,
should be larger than the activation of the other targets.
SNoW implementation: Conservative.
Only the target node that corresponds to the correct label and the
highest activation are compared.
If both are the same target node – no change.
Otherwise, promote one and demote the other.
Page 33
Properties of Construction
Can learn any argmax vi.x function
Can use any algorithm to find linear separation
Perceptron Algorithm
Winnow Algorithm
ultraconservative online algorithm [Crammer, Singer 2001]
multiclass winnow [ Masterharm 2000 ]
Defines a multiclass margin
by binary margin in Rkd
multiclass SVM [Crammer, Singer 2001]
Page 34
Margin Generalization Bounds
Linear Hypothesis space:
h(x) = argsort vi.x
vi, x Rd
argsort returns permutation of {1,...,k}
CC margin-based bound
= min(x,y)S min (i < j)y vi.x – vj.x
C R 2
errD (h) 2 ln( )
m
m - number of examples
R - maxx ||x||
- confidence
C - average # constraints
Page 35
VC-style Generalization Bounds
Linear Hypothesis space:
h(x) = argsort vi.x
vi, x Rd
argsort returns permutation of {1,...,k}
CC VC-based bound
kdlog(mk/d) ln
errD (h) err(S,h)
m
m - number of examples
d - dimension of input space
delta - confidence
k - number of classes
Page 36
Beyond Multiclass Classification
Ranking
Multilabel
x is more red than blue, but not green
Millions of classes
x is both red and blue
Complex relationships
category ranking (over classes)
ordinal regression (over examples)
sequence labeling (e.g. POS tagging)
LATER
SNoW has an implementation of Constraint Classification for the
Multi-Class case. Try to compare with 1-vs-all.
Experimental Issues: when is this version of multi-class better?
Several easy improvements are possible via modifying the loss
function.
Page 37
Multi-class Experiments
Picture isn’t so clear for very high
dimensional problems.
Why?
Page 38
Summary
OvA
Constraint Classification
Learning: independent
Learning: global
fi(x) > 0 iff y=i
Evaluation: global
h(x) = argmax fi(x)
find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global
h(x) = argmax fi(x)
Learn + Inference
Inference Based Training
Learning: independent
Learning: global
fi(x) > 0 iff “i is a part of y”
Evaluation: global Inf
h(x) = argmaxy\inC SU fi(x)
find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global
h(x) = argmax fi(x)
Page 39
Structured Output Learning
Abstract View:
Decomposition versus Constraint Classification
More details: Inference with Classifiers
Page 40
Structured Output Learning:
Semantic Role Labeling
For each verb in a sentence
Identify all constituents that fill a
semantic role
2. Determine their roles
1.
•
•
Core Arguments, e.g., Agent, Patient or
Instrument
Their adjuncts, e.g., Locative, Temporal
or Manner
Y:
All possible ways to
label the tree
C(Y): All valid ways to
label the tree
argmaxy C(Y) g(x,y)
A0 : leaver
I
A2 : benefactor
left my pearls to my child
A1 : thing left
Page 41
Components of Structured Output Learning
y1
y2
y3
Input: X
Output: A collection of variables
Y
Y = (y1,...,yL) {1,...,K}L
Length is example dependent
Constraints on the Output C(Y)
I left mypearls to my child
e.g. non-overlapping, no repeated values...
partition output to valid and invalid assignments
X
Representation
scoring function: g(x,y)
e.g. linear: g(x,y) = w (x,y)
Inference
h(x) = argmaxvalid y g(x,y)
Page 42
Decomposition-based Learning
Many choices for decomposition
Depends on problem, learning model, computation resources,
etc...
Value-based decomposition
A function for each output value
fk(x,l), k = {1,..,K}
e.g. SRL tagging fA0(x,node), fA1(x,node),...
OvA learning
fk(x,node) > 0 iff k=y
Page 43
Learning Discriminant Functions: The
General Setting
g(x,y) > g(x,y’)
w (x,y) > w (x,y’)
w (x,y,y’) = w ((x,y) - (x,y’)) > 0
P(x,y) = {(x,y,y’)} y’ Y \ y
P(S) = {P(x,y)}(x,y) S
y’ Y \ y
y’ Y \ y
Learn unary classifer over P(S)
(binary)
(+P(S),-P(S))
Used in many works [C02,WW00,CS01,CM03,TGK03]
Page 44
Structured Output Learning:
Semantic Role Labeling
Learn a collection of “scoring” functions
g(x,y) = n scoreyn(x,y,n) = n wynyn(x,y,n)
Learn locally (LO, L+I)
for each label variable (node) n = A0
scoreA2(13)
gA0(x,y,n) = wA0A0(x,y,n) > 0 iff yn = A0
Discriminant model dictates:
scoreNONE(3)
Global score
wA0A0(x,y,n) , wA1A1(x,y,n),...
scorev(x,y,n) = wvv(x,y,n)
g(x,y) > g(x,y’),
argmaxy C(Y) g(x,y)
y C(Y)
Learn Globally (IBT)
g(x,y) = w (x,y)
I left mypearls to my child
Page 45
Multi-class
Summary
OvA
Constraint Classification
Learning: Independent
Learning: global
fi(x) > 0 iff y=i
Evaluation: global
Structured Output
h(x) = argmax fi(x)
find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global
h(x) = argmax fi(x)
Learn + Inference
Inference Based Training
Learning: Independent
Learning: global
fi(x) > 0 iff “i is a part of y”
find {fi(x)} s.t. Y = Inference {fi(x)}
Evaluation: global Inference
Evaluation: global inference
h(x) = Inference {fi(x)}
Efficient Learning
h(x) = Inference {fi(x)}
Less Efficent Learning
Page 46