Improving Named Entity Translation Combining Phonetic and

Transcript Improving Named Entity Translation Combining Phonetic and

Improving Named Entity Translation Combining Phonetic and Semantic Similarities

Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School of Computer Science, CMU NAACL 2004

Introduction

      In the 2001 C-E translation evaluation test data, 20% of NEs are not included in the 50K LDC C-E translation lexicon.

Most previous studies focused only on phonetic information There are NEs not translated in phonetic values (e.g. “ 南懷仁 , Ferdinand Verbiest”) Combining phonetic similarities (transliteration) and semantic similarities (context) to cover these non transliterated NEs.

Source language: Chinese Target language: English

Surface String Transliteration

   Training data:  LDC C-E dictionary Bootstrapping unsupervised learning Learning transliterating probabilities between pinyin and English letters       Pre-processing: Romanizing Chinese word into pinyin.

0th iteration: Using editing distances to generate mappings between Chinese and English word pairs..

Using 3,000 word translations with minimum editing distance of the 0th iteration to estimate new transliterating probabilities.

Repeating generating new translation mappings using new transliterating probabilities.

In each iteration, additional 500 pairs with a minimum transliterating cost are added into the existing NE pair list to update new transliterating probabilities.

Repeat until adding more NE pairs does not improve the extraction accuracy further.

Contextual Semantic Similarity

  Training data: a subset of English Xinhua News corpus Context Vector Selection:  POS  Phi-Square:  Weight of POS:  Distance  Weight of Location:  Weight Vector:

Contextual Semantic Similarity

 Semantic Similarity between Context Vectors:  Semantic similarity: 

P(v f |v e )

is computed with a modified IBM translation model-2 [Brown et al. 1993]:    

: the length of the source vector

: the length of the target vector

p(e|f)

: the word translation probability estimated from a C-E aligned corpus with IBM model1

P(v e |v f )

is estimated in the similar way

Cross-lingual Retrieval for NE translations

Cross-lingual Retrieval for NE translations  English NEs in the retrieved text are automatically tagged by IdentiFinder TM from BBN (Bikel et al.,1997).

 Overall similarity score:  The NE pairs with the highest overall similarity scores are considered translations.

 Since NE can be translated in several different ways, and there are typos at times, from among the top NE hypothesis with similar spelling, the one with the highest frequency are chosen as the translation.

Cross-lingual Retrieval for NE translations  Sentence-based or Document-based?

 Test data: Chinese newswire documents  114 Chinese NEs are selected and translated manually  Indexed Corpus: 963,478 English documents from the Xinhua News Agency  Retrieval Model: TF-IDF  Top 1000 results are regarded as the relevant text  The recall of document-based indexing is better. (70% comparing with 60%)

Experiment Results

 Test dataset:  NIST 2002 Machine Translation Evaluation test data    100 Chinese documents, 878 sentences, 25430 words 2469 NEs are automatically tagged    (PER: 20%, LOC: 60%, ORG: 20%) Only PER and LOC are focused Among 1,898 tagged PERs and LOCs, 338 of them are true NEs and not covered by the LDC lexicon.

Baseline system:  The CMU statistical MT system. (Vogel et al., 2003)