Brief Project Overview

Download Report

Transcript Brief Project Overview

ACCURAT:
Metrics for the evaluation of comparability
of multilingual corpora
Andrejs Vasiljevs, Inguna Skadina (Tilde), Bogdan Babych, Serge
Sharof (CTS), Ahmet Aker, Robert Gaizauskas, David Guthrie
(USFD)
LREC 2010 Wokshop on Methods for the automatic acquisition of
Language Resources and their evaluation
May 23, 2010
Challenge of Data Driven MT
 Rapid development of data driven
methods for MT
 Automated acquisition of
linguistic knowledge extracted from
huge parallel corpora provide an
effective solution that minimizes
time- and resource-consuming
manual work
 Applicability of current datadriven methods directly depends on
the availability of very large
quantities of parallel corpus data
 Translation quality of current
data-driven MT systems is very low
for under-resourced languages and
domains
2
ACCURAT MISSION
To significantly improve MT quality
for under-resourced languages and
narrow domains
by researching approaches how
comparable corpora can compensate
for a shortage of linguistic resources
Comparable Corpora
 Non-parallel bi- or multilingual text resources
 Collection of documents that are:
 gathered according to a set of criteria
e.g. proportion of texts of the same genre in the same
domains in the same period
 in two or more languages
 containing overlapping information
 Examples:




multilingual news feeds,
multilingual websites,
Wikipedia articles,
etc.
Wikipedia example
Comparability scale
parallel
corpora
• texts which are true and accurate translations;
• texts which are approximate translations;
strongly
comparable
corpora
• texts from the same source on the same topic with
the same editorial control;
• independently written texts on the same topic;
weakly
comparable
corpora
• texts in the same narrow subject domain and genre;
• texts within the same broader domain and genre. but
varying in subdomains and specific genres;
Noncomparable
• pairs of texts drawn at random from a pair of very
large collections of texts (e.g. the web) in the two
languages
6
Key questions for our research
How to measure comparability?
How to collect comparable
corpora?
How to extract linguistic data for
MT from comparable corpora?
How to get most out of the data
to improve SMT and RBMT?
How to evaluate effect of our
methods?
7
Key objectives
 To create comparability metrics - to develop the
methodology and determine criteria to measure the
comparability of source and target language documents
in comparable corpora
 To develop, analyze and evaluate methods for
automatic acquisition of comparable corpora from the
Web
 To elaborate advanced techniques for extraction of
lexical, terminological and other linguistic data from
comparable corpora to provide training and
customization data for MT
 To measure improvements from applying acquired data
against baseline results from SMT and RBMT systems
 To evaluate and validate the ACCURAT project results in
practical applications
8
ACCURAT Languages
 Focus on under-resourced languages
Latvian, Lithuanian, Estonia, Greek, Croatian,
Romanian, Slovenian
 Major translation directions
e.g. English-Lithuanian. English-Croatian, GermanRomanian
 Minor translation directions
e.g. Lithuanian-Romanian, Romanian-Greek and
Latvian-Lithuanian
 Methods will be adjustable to the new languages and
domains and language independent where possible
 Applicability of methods will be evaluated in usage
scenarios
Project Partners









Tilde (Project Coordinator) - Latvia
University of Sheffield - UK
University of Leeds - UK
Athena Research and Innovation Center in
Information Communication and Knowledge
Technologies - Greece
University of Zagreb - Croatia
DFKI - Germany
Institute of Artificial Intelligence - Romania
Linguatec - Germany
Zemanta - Slovenia
Objectives for comparability metrics
 To develop criteria and automated metrics to
determine the kind and degree of comparability of
comparable corpora and parallelism of documents and
individual sentences within documents
 To evaluate metrics designed for determining similar
documents in comparable corpora
 To develop a methodology to assess comparable
corpora collected from the Web and to choose an
alignment strategy and lexical data extraction
methods.
11
Criteria of Comparability and Parallelism
 Lack of definite methods to determine the criteria of
comparability
 Some attempts to measure the degree of comparability according
to distribution of topics and publication dates of documents in
comparable corpora to estimate the global comparability of the
corpora (Saralegi et al., 2008)
 Some attempts to determine different kinds of document
parallelism in comparable corpora, such as complete parallelism,
noisy parallelism and complete non-parallelism
 Some attempts to define criteria of parallelism of similar
documents in comparable corpora, such as similar number of
sentences, sharing sufficiently many links (up to 30%), and
monotony of links (up to 90% of links do not cross each other)
(Munteanu, 2006)
 Research on automated methods for assessing the composition of
web corpora in terms of domains and genres (Sharoff, 2007)
12
Towards establishing metrics
Metrics: intralingual and interlingual comparability for
genres, domains and topics
 intralingual: distance between corpora and documents
within corpora in the same language
 methods: distance in feature spaces and machine
learning
 interlingual: distance between corpora and
documents in different languages
 methods: dictionaries and existing MT to map feature
spaces between languages
Evaluation: validation of the scale by independent
annotation
Criteria of Comparability and Parallelism
 To investigate criteria for comparability
between corpora concentrating on different
sets of features:
 Lexical features: measuring the degree of 'lexical
overlap' between frequency lists derived from
corpora
 Lexical sequence features: computing N-gram
distances in terms of tokens
 Morpho-syntactic features: computing N-gram
distances in terms of Part-of-Speech codes
14
Features
 Initial set of the features which may be used to identify
the comparability between documents
Language Independent
Language Dependent
(requires translation)
• Document length
• Date
• Character overlap
• Web features
- URL of doc source
- Common links
- Links referring to each other
- Image links
• Other features …
• Lexical overlap
• Document length
• Web features
- Anchor text
- Image alt tag
• Genre
• Domain
• Other features …
Initial comparable corpora
 For development of comparability metrics Initial
Comparable Corpora is collected
 11 M words, 9 languages
%
parallel
ET-EN LV-EN LT-EN
EL-EN
RO-EL
HR-EN
RO-EN RO-DE
SL-EN
9.48
11.82
46.17
13.33
32.62
39.51
6.94
8.52
40.17
strongly
comparable
51.06
37.51
21.83
20.47
30.96
9.44
17.07
32.67
27.98
weakly
comparable
39.46
50.67
32.00
66.20
36.42
51.05
76.00
58.81
31.85
Initial comparable corpora
Domain
International
news
Sports
Admin
Travel
Software
Software
Medicine
Medicine
Genre
Collected
Newswires
14.73%
Newswires
Legal
Advice
Wikipedia
User
manuals
For
doctors
For
patients
8.23%
11%
14.46%
5.83%
22.11%
12.35%
11.30%
Comparable Test Corpora
 Collected for evaluation of the comparablity metrics
 34% parallel texts, 33% strongly comparable texts and
33% weakly comparable texts
 9 languages, 247 000 running words
Domain
Genre
International news
Newswires
13
Sports
Admin
Travel
Software
Software
Newswires
Legal
Advice
Wikipedia
User manuals
11
19
7
6
12
Medicine
For doctors
11
Medicine
For patients
21
Coverage (%)
Benchmarking comparability
 Problems with human labelling:
 They are too coarse-grained
 Symbolic labels pose problems to establish correlation
with numeric scores produced by the metric
 Labelling criteria and/or human annotation may be unsystematic
 Proposal to benchmark comparability metric against
the score of resulting MT quality (e.g., the standard
BLEU/NIST scores)
Initial experiment: Cross-lingual comparison of corpora
 Mapping feature spaces using bilingual dictionaries or
translation probability tables
 Purpose: to see how much is the difference between
frequencies of words and their translations in another
language
 Set-up: 500 most frequent words; Relative
frequencies; Bilingual translation probability tables
(Europarl)
 χ-score (cross-lingual intersection)
 Pearson's correlation with the degree of comparability
20
Initial experiment
 Comparability of corpora is measured in terms of
lexical features (Greek—English and German—English
language pairs)
 The set-up is similar to (Kilgarriff, 2001):
 For each corpus take the top 500 most frequent words
 relative frequency is used (the absolute frequency, or
the word count, divided by the length of the corpus)
 Automatically generated dictionaries by Giza++ from
the parallel Europarl corpus
 We compare corpora pairwise using a standard ChiSquare distance measure:
ChiSquare = ∑
{w1... w500}((FrqObserved
- FrqExpected) ^ 2) /
FrqObserved
3rd BUCC
Malta
22-05-10
21
Initial experiment
 Asymmetric method: relative frequencies in Corpus in
language A are treated as “expected” values, and
those mapped from the Corpus in language B – as
“observed”.
Then we swap Corpora A and B and repeat the
calculation. Asymmetry comes from words which are
missing in one of the lists as compared to the other.
Missing words have different relative frequencies that
are added to the score, so distance from A to B can be
different than from B to A.
3rd BUCC
Malta
22-05-10
We use the minimum of these distances as the final
score for the pair of corpora.
22
Raw scores of Chi-Square across languages for Greek-English
23
Collecting comparable texts from the Web
 How should we collect comparable texts from the
Web?
 Crawling all documents in the Web and use
comparability metrics to align them
 Inefficient
 High computational effort to align
 Instead : Build a classifier/retrieval system that given
a document (or certain characteristics) can find other
comparable documents
 Combine crawling with document pair classification or
searching
Classifier
 Using the initial comparable corpora for:
 features extraction
 training a classifier/ranking system
 Predict whether pairs of documents are




parallel
strongly comparable
weakly comparable
not comparable
General Idea
EN
parallel
EL
Features
extraction
Comparability Level
f1
f2
f3
…
fn
parallel
strongly comparable
weakly comparable
strongly
comparable
EL
EN
weakly
comparable
EL
EN
not
comparable
EN
not comparable
...
New Documents
EL
Predicted
Comparability Level
EN
Initial Comparable Corpora
strongly
comparable
EL
Classifier
Evaluation
 Evaluation of comparability metrics against manually
annotated Comparable Test Corpora (precision, recall)
 Evaluation of document level alignment methods
against manually annotated Comparable Test Corpora
(precision, recall)
 Evaluation of sentence and phrase level alignment for
corpora with different level of comparability – need
for aligned data, idea to spoil parallel corpus
 Automated evaluation of applicability in MT against
baseline systems (BLEU, NIST)
 User evaluation in gisting and post editing scenarios
ACCURAT project has received funding from the
EU 7th Framework Programme for Research and Technological
Development under
Grant Agreement N°248347
Project duration: January 2010 – June 2012
Contact information:
Andrejs Vasiljevs
andrejs tilde.lv
Tilde, Vienibas gatve 75a, Riga
LV1004, Latvia
www.accurat-project.eu