Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi Agenda         Background Training approach Reranking Results Conclusion Future Directions Comparison: VP, MaxEnt and Baseline Application 12/4/2002

Download Report

Transcript Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi Agenda         Background Training approach Reranking Results Conclusion Future Directions Comparison: VP, MaxEnt and Baseline Application 12/4/2002

Re-ranking for NP-Chunking:
Maximum-Entropy Framework
By: Mona Vajihollahi
Agenda








Background
Training approach
Reranking
Results
Conclusion
Future Directions
Comparison: VP, MaxEnt and Baseline
Application
12/4/2002
2
Background





MRF framework was previously used in reranking for natural language parsing
MRF can be viewed in terms of principle of
maximum entropy
It was found to be “too inefficient to run on the
full data set”
The experiment was not completed
No final results on the performance is provided
12/4/2002
3
Training Approach (1)

Goal: Learning a ranking function F ( xi , j )
m
F ( xi , j )  w0 L( xi , j )   wk hk ( xi , j )
k 1






xi,j : The j’th chunking candidate for the i’th sentence
L(xi,j): Log-probability that the base chunking model assigns to xi,j
hk(xi,j): A function specifying the existence of feature fk in xi,j
wk: A parameter corresponding to weight of each feature fk
xi,1 :The candidate with the highest golden score
We need to find parameters of the model, wk’s, such that
it leads to good scores on test data
12/4/2002
4
Training Approach (2)





How to find a good parameter setting?
Try to minimize number of ranking errors F makes on the training
data
Ranking error: a candidate with lower golden score is ranked above
the best candidate
Maximize Likelihood of the golden candidates
Log-Linear Model:

The probabilty of xi,j being the correct chunking for the i’th sentence is
defined as:
F(x
)
P( xi , q ) 
e
ni
e
i ,q
F ( xi , j )
j 1

Use Maximum Entropy framework to estimate probability distribution
12/4/2002
5
Training Approach (3)


First approach, Feature Selection

Goal: Find a small subset of features that contribute most to
maximizing the likelihood of training data

Greedily pick the feature with additive weight, δ, which has the
most impact in maximizing likelihood
The complexity is O(TNFC), where





T: number of iterations (number of selected features)
N: number of sentences in the training set
F: number of features
C: number of iterations needed for convergence of the weight of
each feature
Finding the feature/weight pair with the highest gain, is
too expensive
12/4/2002
6
Training Approach (4)

Second approach, forget about gain, just use GIS
1.
w0 =1 and w0 …wm =0
2.
For each feature fk , expected[k] is the number of times that
feature k is seen in the best chunking:  hk ( xi ,1 )
i
3.
For each feature fk, observed[k] is the number
of times that
n
feature k is seen under the model:
h ( x ) P( x )

i
i
4.
j 1
k
i, j
i, j
For each feature fk
wk = wk + log(observed[k]/expected[k])
5.
12/4/2002
Repeat steps 2-4 until convergence
7
Training Approach(5)




Instead of updating just one weight in each pass
over the training data, all the weights are
updated
The procedure can be repeated until a fixed
number of iterations or until no significant
change in log-likelihood happens
Experiment showed that convergence is
achieved after about 100 rounds
First method might lead to better performance,
but it was too inefficient to be applied!
12/4/2002
8
Reranking
The output of the training phase is a
weight vector
 For each sentence in the test set

 Function F ( xi , j ) specifies
its candidates
the score for each of
m
Score( xi , j )  F ( xi , j )  w0 L( xi , j )   wk hk ( xi , j )
k 1
 The
candidate with the highest score is the
best one
12/4/2002
9
Results (1)

Initial experiment:

Cut-Off: 10
(features with less than 10 counts where omitted)
100
99.8
99.6
Training is
making it
WORSE?!
99.4
99.2
99
98.8
Precision
Recall
98.6
98.4
98.2
98
10
20
30
40
50
60
70
80
90
Rounds
12/4/2002
10
Results (2)


Try other cut-offs
Convergence was occurred by round 100
Cut-off 50 is
worse than 45
99.5
99.3
Precision
99.1
Cut-Off-50
98.9
Cut-Off-45
98.7
Cut-Off-40
Cut-Off-35
98.5
Cut-Off-30
98.3
Cut-Off-20
98.1
Cut-Off-10
10
20
30
40
50
60
70
80
90
100
Rounds
12/4/2002
11
Results (3)
99.95
99.9
Cut-Off-50
Recall
Cut-Off-45
99.85
Cut-Off-40
Cut-Off-35
Cut-Off-30
99.8
Cut-Off-20
Cut-Off-10
99.75
99.7
10
20
30
40
50
60
70
80
90 100
Rounds
12/4/2002
12
Results (4)

Why cut-off 45 performs better than 10?




Feature set is extracted from the training data set
Features with low counts, are probably the dataset-specific ones
As training proceeds, rare features become more important!
Label-Bias Problem: The problem happens when some
decision is made locally, regardless of global history
100
99.8
Why cut-off
45 is better
than 10?
99.6
99.4
99.2
Precision-Cutoff-45
99
Precision-Cutoff-10
98.8
Recall-Cutoff-45
98.6
98.4
Recall-Cutoff-10
98.2
0
10
90
80
70
60
50
40
30
20
10
98
Rounds
12/4/2002
13
Results (5)



Training process is supposed to increase the
likelihood of the training data
Recall is always increasing
Overfitting!
Precision is not!
100
Why does
the precision
decrease?
99.8
99.6
99.4
99.2
99
98.8
Precision-Cutoff-45
98.2
Recall-Cutoff-45
98
Recall-Cutoff-10
0
10
90
80
70
60
50
40
30
Precision-Cutoff-10
20
10
98.6
98.4
Rounds
12/4/2002
14
Conclusion
99.5
Considering the trade-off
between precision and
recall, cut-off 45 has the
best performance
99.3
Cut-Off-50
Precision

99.1
Cut-Off-45
98.9
Cut-Off-40
Cut-Off-35
98.7
Cut-Off-30
98.5
Cut-Off-20
Cut-Off-10
98.3
98.1
10
20
30
40
50
60
70
80
90 100
Rounds
Cut-Off
Precision
Recall
Num. of
Rounds
99.95
10
98.51
99.91
50
99.9
20
98.76
99.89
40
99.85
30
98.91
99.88
60
Cut-Off-50
Recall
Cut-Off-45
Cut-Off-40
Cut-Off-35
Cut-Off-30
99.8
Cut-Off-20
40
99.15
99.89
80
99.75
45
99.25
99.87
50
99.7
Cut-Off-10
10
50
12/4/2002
99.20
99.83
40
20
30
40
50
60
70
80
90 100
Rounds
15
Future Directions

Expand the template set
 Find

more useful feature templates
Try to solve Label Bias problem
 Apply
a smoothing method (like applying a
discount factor, or Guassian Prior)
12/4/2002
16
Comparison:
VP, MaxEnt, Baseline


Both re-ranking methods performs better than the
baseline
MaxEnt



is more complex
should solve Label Bias problem
Voted Perceptron


12/4/2002
is a simple algorithm
achieves better results
Precision
Recall
VP
99.65%
99.98%
MaxEnt
99.25%
99.87%
Base Line
97.71%
99.32%
Max.
99.95%
100.0%
17
Applications


Both methods, can be applied to any
probabilistic baseline chunker (HMM chunker)
The only restriction:
 Baseline
has to produce n-best candidates for each
sentence

Same framework can be used for VP-chunking
 Same
feature templates are used to extract features
for VP-chunking

Higher accuracy in text chunking leads to higher
accuracy in the related tasks

12/4/2002
like larger-scale grouping and subunit extraction
18
Re-ranking for NP-Chunking:
Maximum-Entropy Framework
By: Mona Vajihollahi