SRL via Generalized Inference

Download Report

Transcript SRL via Generalized Inference

Multiclass Classification in NLP

Name/Entity Recognition

Label people, locations, and organizations in a sentence
 [PER Sam Houston],[born in] [LOC Virginia], [was a member of the]
[ORG US Congress].

Decompose into sub-problems




Sam Houston, born in Virginia...
Sam Houston, born in Virginia...
Sam Houston, born in Virginia...
 (PER,LOC,ORG,?)  PER (1)
 (PER,LOC,ORG,?)  None (0)
 (PER,LOC,ORG,?)  LOC (2)
Many problems in NLP are decomposed this way

Disambiguation tasks




POS Tagging
Word-sense disambiguation
Verb Classification
Semantic-Role Labeling
Page 1
Outline

Multi-Categorical Classification Tasks



Decomposition Approaches
Constraint Classification


example: Semantic Role Labeling (SRL)
Unifies learning of multi-categorical classifiers
Structured-Output Learning

revisit SRL

Decomposition versus Constraint Classification

Goal:

Discuss multi-class and structured output from the same
perspective.
 Discuss similarities and differences
Page 2
Multi-Categorical Output Tasks

Multi-class Classification (y  {1,...,K})
character recognition (‘6’)
document classification (‘homepage’)

Multi-label Classification (y  {1,...,K})
document classification (‘(homepage,facultypage)’)
 Category Ranking (y  K)
user preference (‘(love > like > hate)’)
document classification (‘hompage > facultypage > sports’)

Hierarchical Classification (y  {1,..,K})
cohere with class hierarchy
place document into index where ‘soccer’ is-a ‘sport’
Page 3
(more) Multi-Categorical Output Tasks

Sequential Prediction (y  {1,...,K}+)
e.g. POS tagging (‘(NVNNA)’)
“This is a sentence.”  D V N D
e.g. phrase identification
Many labels: KL for length L sentence

Structured Output Prediction (y  C({1,...,K}+))
e.g. parse tree, multi-level phrase identification
e.g. sequential prediction
Constrained by
domain, problem, data, background knowledge, etc...
Page 4
Semantic Role Labeling
A Structured-Output Problem

For each verb in a sentence
Identify all constituents that fill a semantic role
Determine their roles
1.
2.
•
•
Core Arguments, e.g., Agent, Patient or Instrument
Their adjuncts, e.g., Locative, Temporal or Manner
A0 : leaver
A2 : benefactor
I left my pearls to my daughter-in-law in my will.
A1 : thing left
AM-LOC
Page 5
Semantic Role Labeling


A0 -
A1
A2
AM-LOC
I left my pearls to my daughter-in-law in my will.


Many possible valid output
Many possible invalid output
Page 6
Structured Output Problems

Multi-Class




View y=4 as (y1,...,yk) = ( 0 0 0 1 0 0 0 )
The output is restricted by “Exactly one of yi=1”
Learn f1(x),..,fk(x)
Sequence Prediction

e.g. POS tagging: x = (My name is Dav) y = (Pr,N,V,N)
 e.g. restriction: “Every sentence must have a verb”

Structured Output

Arbitrary global constraints
 Local functions do not have access to global constraints!

Goal:

Discuss multi-class and structured output from the same perspective.
 Discuss similarities and differences
Page 7
Transform the sub-problems

Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  PER (1)

Transform each problem to feature vector


Sam Houston, born in Virginia
 (Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN, --BORN,... )
( 0 , 0 ,
1
, 0
, 1
,
1 ,... )
Transform each label to a class label

PER
 LOC
 ORG
 ?


Input :
Output:




1
2
3
0
{0,1}d or Rd
{0,1,2,3,...,k}
Page 8
Solving multiclass with binary learning

Multiclass classifier

Function
f : Rd  {1,2,3,...,k}

Decompose into binary problems

Not always possible to learn
No theoretical justification (unless the problem is easy)

Page 9
The Real MultiClass Problem

General framework


Extend binary algorithms
Theoretically justified




Provably correct
Generalizes well
Verified Experimentally
Naturally extends binary classification algorithms to
mulitclass setting

e.g. Linear binary separation induces linear boundaries in
multiclass setting
Page 10
Multi Class over Linear Functions
– One versus all (OvA)
– All versus all (AvA)
– Direct winner-take-all (D-WTA)
Page 11
WTA over linear functions

Assume examples generated from winner-take-all

y = argmax wi . x + ti
 wi, x  Rn , ti  R
• Note: Voronoi diagrams are WTA functions
argminc || ci – x || = argmaxc ci . x – ||ci||2 / 2
Page 12
Learning via One-Versus-All (OvA)
Assumption


Find vr,vb,vg,vy  Rn such that
 vr.x > 0 iff y = red

 vb.x > 0 iff y = blue

 vg.x > 0 iff y = green

 vy.x > 0 iff y = yellow

Classifier f(x) = argmax vi.x
H = Rkn
Individual
Classifiers
Decision
Regions
Page 13
Learning via All-Verses-All (AvA)
Assumption

Find vrb,vrg,vry,vbg,vby,vgy  Rd such that



vrb.x > 0 if y = red
< 0 if y = blue
vrg.x > 0 if y = red
< 0 if y = green
... (for all pairs)
H = Rkkn
How to
classify?
Individual
Classifiers
Decision
Regions
Page 14
Classifying with AvA
Tree
Majority Vote
1 red, 2 yellow, 2 green
?
Tournament
All are post-learning and might cause weird stuff
Page 15
Summary (1): Learning Binary Classifiers

On-Line: Perceptron, Winnow




SVM




Mistake bounded
Generalizes well (VC-Dim)
Works well in practice
Well motivated to maximize margin
Generalizes well
Works well in practice
Boosting, Neural Networks, etc...
Page 16
From Binary to Multi-categorical

Decompose multi-categorical problems





into multiple (independent) binary problems
Multi-class: OvA, AvA, ECOC, DT, etc...
Multi-label: reduce to multi-class
Categorical Ranking: reduce or regression
Sequence Prediction:

Reduce to Multi-class
 part/alphabet based decompositions

Structured Output:

learn parts of output based on local information!!!
Page 17
Problems with Decompositions

Learning optimizes over local metrics

Poor global performance



Poor decomposition  poor performance



What is the metric?
We don’t care about the performance of the local classifiers
Difficult local problems
Irrelevant local problems
Not clear how to decompose all Multi-category
problems
Page 18
Multi-class OvA Decomposition: a Linear
Representation


Hypothesis: h(x) = argmaxi vix
Decomposition


Learning: One-versus-all (OvA)


Each class represented by a linear function vix
For each class i vix > 0 iff i=y
General Case

Each class represented by a function fi(x) > 0
Page 19
Learning via One-Versus-All (OvA)
Assumption

Classifier f(x) = argmax vi.x
Individual
Classifiers



OvA Learning: Find vi.x > 0 iff y=i
OvA is fine only if data is OvA separable!
Linear classifier can represent this function!
 (voronoi) argmin d(ci,x)  (wta) argmax cix + di
Page 20
Other Issues we Mentioned

Error Correcting Output Codes



Another (class of) decomposition
Difficulty: how to make sure that the resulting problems are separable.
Commented on the advantage of All vs. All when working with the
dual space (e.g., kernels)
Page 21
Example: SNoW Multi-class Classifier
How do we train?
Targets (each an LTU)
How do we
evaluate?
Weighted edges (weight
vectors)
Features
 SNoW only represents the targets and weighted edges
Page 22
Winnow: Extensions


Winnow learns monotone boolean functions
To learn non-monotone boolean functions:
For each variable x, introduce x’ = ¬x
 Learn monotone functions over 2n variables


To learn functions with real valued inputs:
“Balanced Winnow”
 2 weights per variable; effective weight is the difference
 Update rule:

If [(w  w )  x   ]  y, wi  wi r y xi , wi  wi ry xi
Page 23
An Intuition: Balanced Winnow



In most multi-class classifiers you have a target node that represents
positive examples and target node that represents negative
examples.
Typically, we train each node separately (my/not my example).
Rather, given an example we could say: this is more a + example
than a – example.
If [(w  w )  x   ]  y, wi  wi r y xi , wi  wi ry xi

We compared the activation of the different target nodes (classifiers)
on a given example. (This example is more class + than class -)
Page 24
Constraint Classification







Can be viewed as a generalization of the balanced Winnow to
the multi-class case
Unifies multi-class, multi-label, category-ranking
Reduces learning to a single binary learning task
Captures theoretical properties of binary algorithm
Experimentally verified
Naturally extends Perceptron, SVM, etc...
Do all of this by representing labels as a set of constraints or
preferences among output labels.
Page 25
Multi-category to Constraint
Classification

Multiclass


 (x, (A>B, A>C, A>D) )
Multilabel


(x, A)
(x, (A, B))
 (x, ( (A>C, A>D, B>C, B>D) )
Label Ranking

(x, (5>4>3>2>1))  (x, ( (5>4, 4>3, 3>2, 2>1) )

Examples (x,y)
y  Sk
Sk : partial order over class labels {1,...,k}
 defines “preference” relation ( > ) for class labeling

Constraint Classifier

h: X  Sk
Page 26
Learning Constraint Classification
Kesler Construction

Transform Examples
2>1
2>3
2>1
2>3
2>4
i>j
2>4
fi(x) - fj(x)
>0
wi  x - wj  x > 0
W  Xi - W  Xj > 0
W  (Xi - Xj) > 0
W  Xij
>0
Xi = (0,x,0,0)  Rkd
Xj = (0,0,0,x)  Rkd
Xij = Xi - Xj = (0,x,0,-x)
W = (w1,w2,w3,w4)  Rkd
Page 27
Kesler’s Construction (1)

y = argmaxi=(r,b,g,y) vi.x


vi , x  Rn
Find vr,vb,vg,vy  Rn such that



vr.x > vb.x
vr.x > vg.x
vr.x > vy.x
H = Rkn
Page 28
Kesler’s Construction (2)


Let v = (vr,vb,vg,vy )  Rkn
Let 0n, be the n-dim zero vector
x -x



-x x
vr.x > vb.x  v.(x,-x,0n,0n) > 0  v.(-x,x,0n,0n) < 0
vr.x > vg.x  v.(x,0n,-x,0n) > 0  v.(x,0n,-x,0n) < 0
vr.x > vy.x  v.(x,0n,0n,-x) > 0  v.(-x,0n,0n ,x) < 0
Page 29
Kesler’s Construction (3)

Let



v = (v1, ..., vk)  Rn x ... x Rn = Rkn
xij = (0(i-1)n, x, 0(k-i)n) – (0(j-1)n, –x, 0(k-j)n)  Rkn
x
Given (x, y)  Rn x {1,...,k}

For all j  y




-x
Add to P+(x,y), (xyj, 1)
Add to P-(x,y), (–xyj, -1)
P+(x,y) has k-1 positive examples ( Rkn)
P-(x,y) has k-1 negative examples ( Rkn)
Page 30
Learning via Kesler’s Construction


Given (x1, y1), ..., (xN, yN)  Rn x {1,...,k}
Create
P+ =  P+(xi,yi)
 P– =  P–(xi,yi)


Find v = (v1, ..., vk)  Rkn, such that


v.x separates P+ from P–
Output

f(x) = argmax vi.x
Page 31
Constraint Classification

Examples (x,y)



y  Sk
Sk : partial order over class labels {1,...,k}
 defines “preference” relation (<) for class labels
 e.g. Multiclass:
2<1, 2<3, 2<4, 2<5
 e.g. Multilabel:
1<3, 1<4, 1<5, 2<3, 2<4, 4<5
Constraint Classifier

f: X  Sk
f(x) is a partial order

f(x) is consistent with y if (i<j)  y  (i<j) f(x)

Page 32
Implementation

Examples (x,y)




y  Sk
Sk : partial order over class labels {1,...,k}
 defines “preference” relation (>) for class labels
 e.g. Multiclass:
2>1, 2>3, 2>4, 2>5
Given an example that is labeled 2, the activation of target 2 on it,
should be larger than the activation of the other targets.
SNoW implementation: Conservative.

Only the target node that corresponds to the correct label and the
highest activation are compared.
 If both are the same target node – no change.
 Otherwise, promote one and demote the other.
Page 33
Properties of Construction


Can learn any argmax vi.x function
Can use any algorithm to find linear separation

Perceptron Algorithm


Winnow Algorithm


ultraconservative online algorithm [Crammer, Singer 2001]
multiclass winnow [ Masterharm 2000 ]
Defines a multiclass margin


by binary margin in Rkd
multiclass SVM [Crammer, Singer 2001]
Page 34

Margin Generalization Bounds

Linear Hypothesis space:

h(x) = argsort vi.x



vi, x Rd
argsort returns permutation of {1,...,k}
CC margin-based bound

 = min(x,y)S min (i < j)y vi.x – vj.x
C R 2

errD (h)    2  ln( )

m 




m - number of examples
R - maxx ||x||
 - confidence
C - average # constraints
Page 35
VC-style Generalization Bounds

Linear Hypothesis space:

h(x) = argsort vi.x



vi, x Rd
argsort returns permutation of {1,...,k}
CC VC-based bound
 kdlog(mk/d)  ln 
errD (h)  err(S,h)  

m






m - number of examples
d - dimension of input space
delta - confidence
k - number of classes
Page 36
Beyond Multiclass Classification

Ranking



Multilabel




x is more red than blue, but not green
Millions of classes


x is both red and blue
Complex relationships


category ranking (over classes)
ordinal regression (over examples)
sequence labeling (e.g. POS tagging)
LATER
SNoW has an implementation of Constraint Classification for the
Multi-Class case. Try to compare with 1-vs-all.
Experimental Issues: when is this version of multi-class better?
Several easy improvements are possible via modifying the loss
function.
Page 37
Multi-class Experiments
Picture isn’t so clear for very high
dimensional problems.
Why?
Page 38
Summary
OvA
Constraint Classification
Learning: independent
Learning: global
fi(x) > 0 iff y=i
Evaluation: global
h(x) = argmax fi(x)
find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global
h(x) = argmax fi(x)
Learn + Inference
Inference Based Training
Learning: independent
Learning: global
fi(x) > 0 iff “i is a part of y”
Evaluation: global Inf
h(x) = argmaxy\inC SU fi(x)
find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global
h(x) = argmax fi(x)
Page 39
Structured Output Learning

Abstract View:


Decomposition versus Constraint Classification
More details: Inference with Classifiers
Page 40
Structured Output Learning:
Semantic Role Labeling

For each verb in a sentence
Identify all constituents that fill a
semantic role
2. Determine their roles
1.
•
•
Core Arguments, e.g., Agent, Patient or
Instrument
Their adjuncts, e.g., Locative, Temporal
or Manner
Y:
All possible ways to
label the tree
C(Y): All valid ways to
label the tree
argmaxy  C(Y) g(x,y)
A0 : leaver
I
A2 : benefactor
left my pearls to my child
A1 : thing left
Page 41
Components of Structured Output Learning


y1
y2
y3
Input: X
Output: A collection of variables
Y
Y = (y1,...,yL)  {1,...,K}L
 Length is example dependent


Constraints on the Output C(Y)



I left mypearls to my child
e.g. non-overlapping, no repeated values...
partition output to valid and invalid assignments
X
Representation

scoring function: g(x,y)
 e.g. linear: g(x,y) = w  (x,y)

Inference

h(x) = argmaxvalid y g(x,y)
Page 42
Decomposition-based Learning

Many choices for decomposition


Depends on problem, learning model, computation resources,
etc...
Value-based decomposition

A function for each output value



fk(x,l), k = {1,..,K}
e.g. SRL tagging fA0(x,node), fA1(x,node),...
OvA learning

fk(x,node) > 0 iff k=y
Page 43
Learning Discriminant Functions: The
General Setting







g(x,y) > g(x,y’)
w  (x,y) > w  (x,y’)
w  (x,y,y’) = w  ((x,y) - (x,y’)) > 0
P(x,y) = {(x,y,y’)} y’  Y \ y
P(S) = {P(x,y)}(x,y)  S
y’  Y \ y
y’  Y \ y
Learn unary classifer over P(S)
(binary)
(+P(S),-P(S))
Used in many works [C02,WW00,CS01,CM03,TGK03]
Page 44
Structured Output Learning:
Semantic Role Labeling

Learn a collection of “scoring” functions



g(x,y) = n scoreyn(x,y,n) = n wynyn(x,y,n)
Learn locally (LO, L+I)

for each label variable (node) n = A0


scoreA2(13)
gA0(x,y,n) = wA0A0(x,y,n) > 0 iff yn = A0
Discriminant model dictates:



scoreNONE(3)
Global score


wA0A0(x,y,n) , wA1A1(x,y,n),...
scorev(x,y,n) = wvv(x,y,n)
g(x,y) > g(x,y’),
argmaxy  C(Y) g(x,y)
y  C(Y)
Learn Globally (IBT)

g(x,y) = w (x,y)
I left mypearls to my child
Page 45
Multi-class
Summary
OvA
Constraint Classification
Learning: Independent
Learning: global
fi(x) > 0 iff y=i
Evaluation: global
Structured Output
h(x) = argmax fi(x)
find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global
h(x) = argmax fi(x)
Learn + Inference
Inference Based Training
Learning: Independent
Learning: global
fi(x) > 0 iff “i is a part of y”
find {fi(x)} s.t. Y = Inference {fi(x)}
Evaluation: global Inference
Evaluation: global inference
h(x) = Inference {fi(x)}
Efficient Learning
h(x) = Inference {fi(x)}
Less Efficent Learning
Page 46