Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi Agenda Background Training approach Reranking Results Conclusion Future Directions Comparison: VP, MaxEnt and Baseline Application 12/4/2002
Download ReportTranscript Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi Agenda Background Training approach Reranking Results Conclusion Future Directions Comparison: VP, MaxEnt and Baseline Application 12/4/2002
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi Agenda Background Training approach Reranking Results Conclusion Future Directions Comparison: VP, MaxEnt and Baseline Application 12/4/2002 2 Background MRF framework was previously used in reranking for natural language parsing MRF can be viewed in terms of principle of maximum entropy It was found to be “too inefficient to run on the full data set” The experiment was not completed No final results on the performance is provided 12/4/2002 3 Training Approach (1) Goal: Learning a ranking function F ( xi , j ) m F ( xi , j ) w0 L( xi , j ) wk hk ( xi , j ) k 1 xi,j : The j’th chunking candidate for the i’th sentence L(xi,j): Log-probability that the base chunking model assigns to xi,j hk(xi,j): A function specifying the existence of feature fk in xi,j wk: A parameter corresponding to weight of each feature fk xi,1 :The candidate with the highest golden score We need to find parameters of the model, wk’s, such that it leads to good scores on test data 12/4/2002 4 Training Approach (2) How to find a good parameter setting? Try to minimize number of ranking errors F makes on the training data Ranking error: a candidate with lower golden score is ranked above the best candidate Maximize Likelihood of the golden candidates Log-Linear Model: The probabilty of xi,j being the correct chunking for the i’th sentence is defined as: F(x ) P( xi , q ) e ni e i ,q F ( xi , j ) j 1 Use Maximum Entropy framework to estimate probability distribution 12/4/2002 5 Training Approach (3) First approach, Feature Selection Goal: Find a small subset of features that contribute most to maximizing the likelihood of training data Greedily pick the feature with additive weight, δ, which has the most impact in maximizing likelihood The complexity is O(TNFC), where T: number of iterations (number of selected features) N: number of sentences in the training set F: number of features C: number of iterations needed for convergence of the weight of each feature Finding the feature/weight pair with the highest gain, is too expensive 12/4/2002 6 Training Approach (4) Second approach, forget about gain, just use GIS 1. w0 =1 and w0 …wm =0 2. For each feature fk , expected[k] is the number of times that feature k is seen in the best chunking: hk ( xi ,1 ) i 3. For each feature fk, observed[k] is the number of times that n feature k is seen under the model: h ( x ) P( x ) i i 4. j 1 k i, j i, j For each feature fk wk = wk + log(observed[k]/expected[k]) 5. 12/4/2002 Repeat steps 2-4 until convergence 7 Training Approach(5) Instead of updating just one weight in each pass over the training data, all the weights are updated The procedure can be repeated until a fixed number of iterations or until no significant change in log-likelihood happens Experiment showed that convergence is achieved after about 100 rounds First method might lead to better performance, but it was too inefficient to be applied! 12/4/2002 8 Reranking The output of the training phase is a weight vector For each sentence in the test set Function F ( xi , j ) specifies its candidates the score for each of m Score( xi , j ) F ( xi , j ) w0 L( xi , j ) wk hk ( xi , j ) k 1 The candidate with the highest score is the best one 12/4/2002 9 Results (1) Initial experiment: Cut-Off: 10 (features with less than 10 counts where omitted) 100 99.8 99.6 Training is making it WORSE?! 99.4 99.2 99 98.8 Precision Recall 98.6 98.4 98.2 98 10 20 30 40 50 60 70 80 90 Rounds 12/4/2002 10 Results (2) Try other cut-offs Convergence was occurred by round 100 Cut-off 50 is worse than 45 99.5 99.3 Precision 99.1 Cut-Off-50 98.9 Cut-Off-45 98.7 Cut-Off-40 Cut-Off-35 98.5 Cut-Off-30 98.3 Cut-Off-20 98.1 Cut-Off-10 10 20 30 40 50 60 70 80 90 100 Rounds 12/4/2002 11 Results (3) 99.95 99.9 Cut-Off-50 Recall Cut-Off-45 99.85 Cut-Off-40 Cut-Off-35 Cut-Off-30 99.8 Cut-Off-20 Cut-Off-10 99.75 99.7 10 20 30 40 50 60 70 80 90 100 Rounds 12/4/2002 12 Results (4) Why cut-off 45 performs better than 10? Feature set is extracted from the training data set Features with low counts, are probably the dataset-specific ones As training proceeds, rare features become more important! Label-Bias Problem: The problem happens when some decision is made locally, regardless of global history 100 99.8 Why cut-off 45 is better than 10? 99.6 99.4 99.2 Precision-Cutoff-45 99 Precision-Cutoff-10 98.8 Recall-Cutoff-45 98.6 98.4 Recall-Cutoff-10 98.2 0 10 90 80 70 60 50 40 30 20 10 98 Rounds 12/4/2002 13 Results (5) Training process is supposed to increase the likelihood of the training data Recall is always increasing Overfitting! Precision is not! 100 Why does the precision decrease? 99.8 99.6 99.4 99.2 99 98.8 Precision-Cutoff-45 98.2 Recall-Cutoff-45 98 Recall-Cutoff-10 0 10 90 80 70 60 50 40 30 Precision-Cutoff-10 20 10 98.6 98.4 Rounds 12/4/2002 14 Conclusion 99.5 Considering the trade-off between precision and recall, cut-off 45 has the best performance 99.3 Cut-Off-50 Precision 99.1 Cut-Off-45 98.9 Cut-Off-40 Cut-Off-35 98.7 Cut-Off-30 98.5 Cut-Off-20 Cut-Off-10 98.3 98.1 10 20 30 40 50 60 70 80 90 100 Rounds Cut-Off Precision Recall Num. of Rounds 99.95 10 98.51 99.91 50 99.9 20 98.76 99.89 40 99.85 30 98.91 99.88 60 Cut-Off-50 Recall Cut-Off-45 Cut-Off-40 Cut-Off-35 Cut-Off-30 99.8 Cut-Off-20 40 99.15 99.89 80 99.75 45 99.25 99.87 50 99.7 Cut-Off-10 10 50 12/4/2002 99.20 99.83 40 20 30 40 50 60 70 80 90 100 Rounds 15 Future Directions Expand the template set Find more useful feature templates Try to solve Label Bias problem Apply a smoothing method (like applying a discount factor, or Guassian Prior) 12/4/2002 16 Comparison: VP, MaxEnt, Baseline Both re-ranking methods performs better than the baseline MaxEnt is more complex should solve Label Bias problem Voted Perceptron 12/4/2002 is a simple algorithm achieves better results Precision Recall VP 99.65% 99.98% MaxEnt 99.25% 99.87% Base Line 97.71% 99.32% Max. 99.95% 100.0% 17 Applications Both methods, can be applied to any probabilistic baseline chunker (HMM chunker) The only restriction: Baseline has to produce n-best candidates for each sentence Same framework can be used for VP-chunking Same feature templates are used to extract features for VP-chunking Higher accuracy in text chunking leads to higher accuracy in the related tasks 12/4/2002 like larger-scale grouping and subunit extraction 18 Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi