Transcript Bleu.ppt

Re-evaluating Bleu
Alison Alvarez
Machine Translation Seminar
February 16, 2006
Overview
• The Weaknesses of Bleu





Introduction
Precision and Recall
Fluency and Adequacy
Variations Allowed by Bleu
Bleu and Tides 2005
• An Improved Model
 Overview of the Model
 Experiment
 Results
• Conclusions
Spring 2006 MT Seminar
Introduction
• Bleu has been shown to have high
correlations with human judgments
• Bleu has been used by MT researchers for
five years, sometimes in place of manual
human evaluations
• But does the minimization of the error rate
accurately show improvements in
translation quality?
Spring 2006 MT Seminar
Precision and Bleu
• Of my answers, how many are
right/wrong?
• Precision = B  C / C or A/C
A
B
Reference Translation
C
Hypothesis Translation
Spring 2006 MT Seminar
Precision and Bleu
Bleu is a precision based metric
• The modified precision score, pn:
Pn = ∑sc ∑ngramsCountmatched(ngram)
∑sc ∑ngramsCount(ngram)
Spring 2006 MT Seminar
Recall and Bleu
• Of the potential answers how many did I
retrieve/miss?
• Recall = B  C / B or A/B
C
A
B
Reference Translation
Hypothesis Translation
Spring 2006 MT Seminar
Recall and Bleu
• Because Bleu uses multiple reference
translations at once, recall cannot be
calculated
Spring 2006 MT Seminar
Fluency and Adequacy to Evaluators
• Fluency
 “How do you judge the fluency of this
translation”
 Judged with no reference translation and to
the standard of written English
• Adequacy
 “How much of the meaning expressed in the
reference is also expressed in the hypothesis
translation?”
Spring 2006 MT Seminar
Variations
• Bleu allows for variations in word and
phrase order that lead to less fluency
• No constraints occur on the order of
matching n-grams
Spring 2006 MT Seminar
Variations
Spring 2006 MT Seminar
Variations
The above two translations have the same bigram score.
Spring 2006 MT Seminar
Bleu and Tides 2005
• Bleu scores showed significant divergence
from human judgments in the 2005 Tides
Evaluation
• It ranked the system considered the best
by humans as sixth in performance
Spring 2006 MT Seminar
Bleu and Tides 2005
• Reference: Iran had already announced Kharazi would
boycott the conference after Jordan’s King Abdullah II
accused Iran of meddling in Iraq’s affairs
• System A: Iran has already stated that Kharazi’s
statements to the conference because of the Jordanian
King Abdullah II in which he stood accused Iran of
interfering in Iraqi affairs.
• N-gram matches: 1-gram: 27; 2-gram: 20; 3-gram: 15; 4
gram: 10
• Human scores: Adequacy: 3,2; Fluency 3,2
From Callison-Burch 2005
Spring 2006 MT Seminar
Bleu and Tides 2005
• Reference: Iran had already announced Kharazi would
boycott the conference after Jordan’s King Abdullah II
accused Iran of meddling in Iraq’s affairs
• System B: Iran already announced that Kharazi will not
attend the conference because of statements made by
Jordanian Monarch Abdullah II who has accused Iran of
interfering in Iraqi affairs.
• N-gram matches: 1-gram: 24; 2-gram: 19; 3-gram: 15; 4
gram: 12
• Human scores: Adequacy: 5,4; Fluency 5,4
From Callison-Burch 2005
Spring 2006 MT Seminar
An Experiment with Bleu
Spring 2006 MT Seminar
Bleu and Tides 2005
• “This opens the possibility that in order to
for Bleu to be valid only sufficiently similar
systems should be compared with one
another”
Spring 2006 MT Seminar
Additional Flaws
• Multiple Human reference translations are
expensive
• N-grams showing up in multiple reference
translations are weighted the same
• Content words are weighed the same as
common words
 ‘The’ counts the same as ‘Parliament’
• Bleu accounts for the diversity of human
translations, but not synonyms
Spring 2006 MT Seminar
An Extension of Bleu
• Described in Babych & Hartley, 2004
• Adds weights to matched items using
 tf/idf
 S-score
Spring 2006 MT Seminar
Addressing Flaws
• Can work with only one human translation
 Can actually calculate recall
 The paper is not very clear about this sentence is
selected
• Content words are weighed the differently than
common words
 ‘The’ does not count the same as ‘Parliament’
Spring 2006 MT Seminar
Calculating the tf/idf Score
• tf.idf(i,j) = (1 + log (tfi,j)) log (N / dfi),
• if tfi,j ≥ 1; where:
 tfi,j is the number of occurrences of the word wi in the
document dj;
 dfi is the number of documents in the corpus where the
word wi occurs;
• N is the total number of documents in the corpus.
From Babych 2004
Spring 2006 MT Seminar
Calculating the S-Score
• The S-score was calculated as:

P
S (i, j)  log
doc (i , j )
 Pcorpdoc( i )]  ( N  df( i) ) / N
Pcorp(i )
 Pdoc(i,j) is the relative frequency of the word in the text
 Pcorp-doc(i) is the relative frequency of the same word in the
rest of the corpus, without this text;
 (N – df(i)) / N is the proportion of texts in the corpus, where
this word does not occur
 Pcorp(i) is the relative frequency of the word in the whole
corpus, including this particular text.
Spring 2006 MT Seminar
Integrating the S-Score
• If for a lexical item in a text the S-score > 1,
all counts for the N-grams containing this item
are increased by the S-score (not just by 1, as
in the baseline BLEU approach).
• If the S-score ≤1; the usual N-gram count is
applied: the number is increased by 1.
From Babych 2004
Spring 2006 MT Seminar
The Experiment
• Used 100 French-English texts from the
DARPA-94 evaluation corpus
• Included two reference translations
• Results from 4 Different MT systems
Spring 2006 MT Seminar
The Experiment
• Stage 1:
 tf/idf & S-scores are calculated on the two reference
translations
• Stage 2:
 N-gram based evaluation using Precision, Recall of ngrams in MT output
 N-gram matches were adjusted to N-gram weights or
S-Score
• Stage 3:
 Comparison with human scores
Spring 2006 MT Seminar
Results for tf/idf
System
[ade] / [flu]
BLEU
[1&2]
Prec.
(w) 1/2
Recall
(w) 1/2
Fscore
(w) 1/2
CANDIDE
0.677 / 0.455
0.3561
0.4767
0.4709
0.3363
0.3324
0.3944
0.3897
GLOBALINK
0.710 / 0.381
0.3199
0.4289
0.4277
0.3146
0.3144
0.3630
0.3624
MS
0.718 / 0.382
0.3003
0.4217
0.4218
0.3332
0.3354
0.3723
0.3737
REVERSO
NA / NA
0.3823
0.4760
0.4756
0.3643
0.3653
0.4127
0.4132
SYSTRAN
0.789 / 0.508
0.4002
0.4864
0.4813
0.3759
0.3734
0.4241
0.4206
Corr r(2) with
[ade] – MT
0.5918
0.3399
0.3602
0.7966
0.8306
0.6479
0.6935
Corr r(2) with
[flu] – MT
0.9807
0.9665
0.9721
0.8980
0.8505
0.9853
0.9699
Spring 2006 MT Seminar
Results for S-Score
System
[ade] / [flu]
BLEU
[1&2]
Prec.
(w) 1/2
Recall
(w) 1/2
Fscore
(w) 1/2
CANDIDE
0.677 / 0.455
0.3561
0.4570
0.4524
0.3281
0.3254
0.3820
0.3785
GLOBALINK
0.710 / 0.381
0.3199
0.4054
0.4036
0.3086
0.3086
0.3504
0.3497
MS
0.718 / 0.382
0.3003
0.3963
0.3969
0.3237
0.3259
0.3563
0.3579
REVERSO
NA / NA
0.3823
0.4547
0.4540
0.3563
0.3574
0.3996
0.4000
SYSTRAN
0.789 / 0.508
0.4002
0.4633
0.4585
0.3666
0.3644
0.4094
0.4061
Corr r(2) with
[ade] – MT
0.5918
0.2945
0.2996
0.8046
0.8317
0.6184
0.6492
Corr r(2) with
[flu] – MT
0.9807
0.9525
0.9555
0.9093
0.8722
0.9942
0.9860
Spring 2006 MT Seminar
Results
• The n-gram model beats BLEU in
adequacy
• The f-score metric is more strongly
correlated with fluency
• Single Reference translations are stable
(add stability chart?)
Spring 2006 MT Seminar
Conclusions
• The Bleu model can be too coarse to show
differentiate between very different MT
systems
• Adequacy is harder to predict than fluency
• Adding weights and using recall and fscores can bring higher correlations with
adequacy and fluency scores
Spring 2006 MT Seminar
References
•
•
•
•
•
•
Chris Callison-Burch, Miles Osborne and Philipp Koehn. 2006. Re-evaluating the
Role of Bleu in Machine Translation Research, to appear in EACL-06.
Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. BLEU: a
Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th
Annual Meeting of the Association for Computational Linguistics (ACL-02).
Philadelphia, PA. July 2002. pp. 311-318.
Babych B, Hartley A. 2004. Extending BLEU MT Evaluation Method with Frequency
Weighting, In Proceedings of the 42th Annual Meeting of the Association for
Computational Linguistics (ACL-04). Barcelona, Spain. July 2004.
Dan Melamed, Ryan Green, and joseph P. Turian. Precision and recall of machine
translation. In Proceedings of the Human Language Technology Conference (HLT),
pages 61--63, Edmonton, Alberta, May 2003. HLT-NAACL.
http://citeseer.csail.mit.edu/melamed03precision.html
Deborah Coughlin. 2003. Correlating automated andhuman assessments of
machine translation quality.In Proceedings of MT Summit IX.
LDC. 2005. Linguistic data annotation specification:Assessment of fluency and
adequacy in translations.Revision 1.5
Spring 2006 MT Seminar
Precision and Bleu
• The Brevity Penalty is designed to
compensate for overly terse translations
BP =
{
1 if c > r
e1-r/c
if c ≤ r
c = length of corpus of hypothesis translations
r = effective corpus length*
Spring 2006 MT Seminar
Precision and Bleu
• Thus, the total Bleu score is this:
n
BLEU = BP * exp(∑ wn log pn)
n=1
Spring 2006 MT Seminar
Flaws in the Use of Bleu
• Experiments with Bleu, but no manual
evaluation (Callison-Burch 2005)
Spring 2006 MT Seminar