Transcript ppt

STRUCTURED
PERCEPTRON
Alice Lai and Shi Zhi
Presentation Outline
• Introduction to Structured Perceptron
• ILP-CRF Model
• Averaged Perceptron
• Latent Variable Perceptron
Motivation
• An algorithm to learn weights for structured prediction
• Alternative to POS tagging with MEMM and CRF (Collins
2002)
• Convergence guarantees under certain conditions even
for inseparable data
• Generalizes to new examples and other sequence
labeling problems
POS Tagging Example
Example:
the
man
saw
the
D
D
D
D
D
N
N
N
N
N
A
A
A
A
A
V
V
V
V
V
Gold labels: the/D man/N saw/V the/D dog/N
Prediction: the/D man/N saw/N the/D dog/N
Parameter update:
Add 1: 𝛼 𝐷,𝑁,𝑉 , 𝛼 𝑁,𝑉,𝐷 , 𝛼 𝑉,𝐷,𝑁 , 𝛼 𝑉,𝑠𝑎𝑤
Subtract 1: 𝛼 𝐷,𝑁,𝑁 , 𝛼 𝑁,𝑁,𝐷 , 𝛼 𝑁,𝐷,𝑁 , 𝛼 𝑁,𝑠𝑎𝑤
dog
MEMM Approach
• Conditional model: probability of the current state given
previous state and current observation
• For tagging problem, define local features for each tag in
context
• Features are often indicator functions
• Learn parameter vector α with Generalized Iterative
Scaling or gradient descent
Global Features
• Local features are defined only for a single label
• Global features are defined for an observed sequence
and a possible label sequence
• Simple version: global features are local features summed
over an observation-label sequence pair
• Compared to original perceptron algorithm, we have
prediction of a vector of labels instead of a single label
• Which of the possible incorrect label vectors do we use as the
negative example in training?
Structured Perceptron Algorithm
Input: training examples (𝑥𝑖 , 𝑦𝑖 )
Initialize parameter vector 𝛼 = 0
For t = 1…max_iter:
For i = 1…n:
𝑦 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦∈GEN 𝑥𝑖 Φ 𝑥𝑖 , 𝑦 ⋅ 𝛼
If 𝑦 ∗ ≠ 𝑦𝑖 then update: 𝛼 = 𝛼 + Φ 𝑥𝑖 , 𝑦𝑖 − Φ 𝑥𝑖 , 𝑦 ∗
Output: parameter vector 𝛼
GEN(𝑥𝑖 ) enumerates possible label sequences 𝑦for
observed sequence 𝑥𝑖 .
Properties
• Convergence
• Data 𝑥, 𝑦 is separable with margin 𝛿 > 0 if there is some vector 𝑼
where 𝑈 = 1 such that ∀𝑖, ∀𝑦 ∈ GEN 𝑥𝑖 − 𝑦𝑖 , 𝑼 ⋅ Φ 𝑥𝑖 , 𝑦𝑖 − 𝑼 ⋅
Φ 𝑥𝑖 , 𝑦𝑖 ≥ 𝛿
• For data 𝑥, 𝑦 that is separable with margin 𝛿, then the number of
𝑅2
mistakes made in training is bounded by 𝛿2 where 𝑅 is a constant such
that ∀𝑖, ∀𝑦 ∈ GEN 𝑥𝑖 − 𝑦𝑖 , Φ 𝑥𝑖 , 𝑦𝑖 − Φ 𝑥𝑖 , 𝑦𝑖 ≤ 𝑅
• Inseparable case
• Number of mistakes ≤ min
𝑈,𝛿
(𝑅+𝐷2𝑈,𝛿 )
𝛿2
• Generalization
Theorems and proofs from Collins 2002
Global vs. Local Learning
• Global learning (IBT): constraints are used during training
• Local learning (L+I): classifiers are trained without
constraints, constraints are applied later to produce global
output
• Example: ILP-CRF model [Roth and Yih 2005]
Perceptron IBT
• This is structured perceptron!
Input: training examples (𝑥𝑖 , 𝑦𝑖 )
Initialize parameter vector 𝑤 = 0
For t = 1…max_iter:
For i = 1…n:
𝑦 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦∈GEN 𝑥𝑖 Φ 𝑥𝑖 , 𝑦 ⋅ 𝛼
If 𝑦 ∗ ≠ 𝑦𝑖 then update: 𝛼 = 𝛼 + F 𝑥𝑖 , 𝑦𝑖 − F 𝑥𝑖 , 𝑦 ∗
Output: parameter vector 𝛼
GEN(𝑥𝑖 ) enumerates possible label sequences for observed sequence
𝑥.
F is a scoring function.
Perceptron I+L
• Decomposition: 𝑦 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝛼 ⋅ 𝜙 𝑥 + 𝜌 ⋅ Φ 𝑥, 𝑦
• Prediction: 𝑦 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦∈GEN 𝑥𝑖 𝜙 𝑥𝑖 ⋅ 𝛼
• If 𝑦 ∗ ≠ 𝑦𝑖 then update: 𝛼 = 𝛼 + F 𝑥𝑖 , 𝑦𝑖 − F 𝑥𝑖 , 𝑦 ∗
• Either learn parameter vector 𝜌 for global features Φ or do
inference only at evaluation time
ILP-CRF Introduction [Roth and Yih 2005]
• ILP-CRF model for Semantic Role Labeling as a
sequence labeling problem
• Viterbi inference for CRFs can include constraints
• Cannot handle long-range or general constraints
• Viterbi is a shortest path problem that can be solved with ILP
• Use integer linear programming to express general
constraints during inference
• Allows incorporation of expressive constraints, including long-range
constraints between distant tokens that cannot be handled by
Viterbi
s
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
t
ILP-CRF Models
• CRF trained with max log-likelihood
• CRF trained with voted perceptron
• I+L
• IBT
• Local training (L+I)
• Perceptron, winnow, voted perceptron, voted winnow
ILP-CRF Results
Sequential Models
L+I
Local
IBT
L+I
ILP-CRF Conclusions
• Performance of local learning models perform poorly
improves dramatically when constraints are added at
evaluation
• Performance is comparable to IBT methods
• The best models for global and local training show
comparable results
• L+I vs. IBT: L+I requires fewer training examples, is more
efficient, outperforms IBT in most situations (unless local
problems are difficult to solve) [Punyakanok et. al , IJCAI
2005]
Variations: Voted Perceptron
• For iteration t=1,…,T
•
For example i=1,…,n
t ,i
•
Given parameter ,by Viterbi Decoding,
•
Get sequence labels for one example
best _ tags i  arg m ax tags i 
t ,i
  ( w ords , tags )
i
i
• Each example define a tagging sequence.
• The voted perceptron: takes the most frequently
ocurring output in the set
1
n
{best _ tags , ....., best _ tags }
Variations: Voted Perceptron
• Averaged algorithm(Collins‘02): approximation of
the voted method. It takes the averaging parameter 
instead of final parameter  T , n
t ,i
   t 1,..,T , i 1,..., n  / n T
• Performance:
• Higher F-Measure, Lower error rate
• Greater Stability on variance in its scores
• Variation: modified averaged algorithm for latent
perceptron
Variations: Latent Structure Perceptron
• Model Definition
y '  arg m ax (m ax     ( x , h , y ))
y Y
h H
•  is the parameter for perceptron.   ( ) is the feature
encoding function mapping to feature vector
• In NER task, x is word sequence, y is the named-entity type
sequence, h is the hidden latent variable sequence.
• Features: unigram bigram for word,
POS and orthography
(prefix, upper/lower case)
• Why latent variables?
• Capture latent dependencies
(i.e. hidden sub-structure)
Variations: Latent Structure Perceptron
• Purely Latent Structure Perceptron(Connor’s)
• Training(Structure perceptron with margin)
• C: margin
• Alpha: learning rate
• Variation: modified averaging parameter method(Sun’s): re-initiate
the parameter with averaged parameter in each k iteration.
• Advantage: reduce overfitting of the latent perceptron
Variations: Latent Structure Perceptron
• Disadvantage of purely latent perceptron:
h* is found and then forgotten for each x.
• Solution: Online Latent Classifier (Connor’s)
• Two classifiers:
latent classifier: parameter: u
label classifier: parameter: w
( y , h )  arg m ax (    ( x , h , y )  u   u ( x , h ))
*
*
y Y , h H
Variations: Latent Structure Perceptron
• Online Latent Classifier Training(Connor’s)
Variations: Latent Structure Perceptron
• Experiments: Bio-NER with purely latent perceptron
cc: cut-off
Odr:#order
dependency
Train-time
F-measure
High-order
Variations: Latent Structure Perceptron
• Experiments: Semantic Role Labeling with
argument/predicate as latent structure
• X: She
likes
yellow
flowers
(sentence)
• Y: agent predicate
-----patient (role)
• H: predicate: only one; argument: at least one (latent
structure)
• Optimization for (h*,y*): search all possible
argument/predicate structure. For more complex data, need
other methods.
On test set:
Summary
• Structured Perceptron definition and motivation
• IBT vs. L+I
• Variations of Structure Perceptron
References:
• Discriminative Training for Hidden Markov Models: Theory and
Experiments with Perceptron Algorithms, M. Collins, EMNLP 2002.
• Latent Variable Perceptron Algorithm for Structured Classification, Sun,
Xu, Takuya Matsuzaki, Daisuke Okanohara and Jun'ichi Tsujii, IJCAI
2009.
• Integer Linear Programming Inference for Conditional Random Fields, D.
Roth, W. Yih, ICML 2005.
• Online Latent Structure Training for Language Acquisition, M. Connor
and C. Fisher and D. Roth, IJCAI 2011