Official presentation

Download Report

Transcript Official presentation

Analysis and Evaluation of Comparable
Corpora for Under Resourced Areas of
Machine Translation
Inguna Skadiņa, Andrejs Vasiļjevs,
Raivis Skadiņš, Robert Gaizauskas,
Dan Tufiş and Tatiana Gornostay
Challenge of Data Driven MT
- Rapid development of data
driven methods for MT
- Automated acquisition of
linguistic knowledge extracted
from huge parallel corpora
provide an effective solution that
minimizes time- and resourceconsuming manual work
3rd BUCC
Malta
22-05-10
- Applicability of current datadriven methods directly depends
on the availability of very large
quantities of parallel corpus data
- Translation quality of current
data-driven MT systems is very low
for under-resourced languages and
domains
2
Problem of availability of linguistic
resources
 Relevant for “smaller” or under-resourced languages
 Example of Latvian:
 few parallel corpora of reasonable size (e.g., JRC
Acquis, EMEA)
 SMT trained on this corpora performs well on domain
documents, but it has unacceptable results for other
domains (en-lv 43.4 BLEU in domain, 10.2 BLEU out of
domain)
 Solution: comparable corpora are much more widely
available than parallel translation data
3rd BUCC
Malta
22-05-10
3
Accurat project
 The Accurat mission is to significantly improve
MT quality
 for under-resourced languages and narrow
domains
 by researching novel approaches how comparable
corpora can compensate for a shortage of linguistic
resources
 ACCURAT methods will be:
 Adjustable to new languages and domains
 Language independent where possible
 2.5 year project, started on January 1, 2010
3rd BUCC
Malta
22-05-10
4
Key objectives
3rd BUCC
Malta
22-05-10
 To create comparability metrics - to develop the
methodology and determine criteria to measure the
comparability of source and target language documents
in comparable corpora
 To develop, analyze and evaluate methods for
automatic acquisition of comparable corpora from the
Web
 To elaborate advanced techniques for extraction of
lexical, terminological and other linguistic data from
comparable corpora to provide training and
customization data for MT
 To measure improvements from applying acquired data
against baseline results from SMT and RBMT systems
 To evaluate and validate the ACCURAT project results in
practical applications
5
Use Cases
 Adjusting MT to narrow domain
Automotive engineering, assistive technology and
data processing domains
 Application for Web authoring
Blog and social networking (Zemanta application)
 Using SMT in software localization
Increasing efficiency in localization, integration
with CAT tools
3rd BUCC
Malta
22-05-10
6
Language Coverage
 Focus on under-resourced languages: Latvian,
Lithuanian, Estonian, Greek, Croatian, Romanian
and Slovenian
 Major translation directions like EnglishLithuanian, English-Croatian, GermanRomanian
 Minor translation directions like LithuanianRomanian, Romanian-Greek and LatvianLithuanian
3rd BUCC
Malta
22-05-10
7
Work Plan
3rd BUCC
Malta
22-05-10
 WP1: To create comparability metrics – to develop
the methodology and determine criteria to measure
the comparability of source and target language
documents in comparable corpora (M3-M24)
 WP2: To elaborate advanced techniques for
extraction of lexical, terminological and other
linguistic data from comparable corpora to provide
training and customization data for MT (M3-M23)
 WP3: To develop, analyze and evaluate methods for
automatic acquisition of a comparable corpus from
the Web (M1-M22)
 WP4: To measure improvements from applying
acquired data against results from baseline SMT and
RBMT systems (M7-M26)
8
Work Plan
 WP5: To evaluate and validate the ACCURAT project
results in three practical applications (M7-M30)
 WP6: To disseminate project results and to transfer
the project knowledge, technologies, lessons learned
and best practices to interested communities and
thus to ensure their worldwide impact and long-term
sustainability (M1-M30)
 WP7: To coordinate the project and provide
administrative and financial management (M1-M30)
3rd BUCC
Malta
22-05-10
9
Milestones
Initial comparable
corpora (M3)
Initial comparability
metrics (M6)
• Tools for collecting comparable
corpora from the Web (M22)
• Multilingual comparable corpora
(M22)
• Criteria and metrics of
comparability and parallelism (M24)
Application of existing • Alignment and extraction methods
for comparable corpora (M20)
alignment methods
(M6)
Baseline SMT systems
(M9)
3rd BUCC
Malta
22-05-10
• Improved MT systems (M26)
• Adjusted MT systems in applications
(M30)
10
Key Results
 Comparability metrics
developed and tools provided
 Comparable corpora for
under-resourced languages
collected and tools provided
 Methods and tools for multilevel alignment from
comparable corpora
developed
 Methods for using comparable
corpora in both SMT and
RBMT developed
 Proven application scenarios
prepared
3rd BUCC
Malta
22-05-10
Strong increase
in MT quality for
under-resourced
languages and
narrow domains
11
Initial comparable corpora (ICC)
Domain
Genre
International news Newswires
Sports
Newswires
Admin
Legal
Travel
Advice
Software
Wikipedia
Software
User manuals
Medicine
For doctors
Percent
20%
10%
10%
10%
15%
15%
10%
Medicine
10%
For patients
 1 million tokens for each under-resourced language
 domain corpus for en-de
3rd BUCC
Malta
22-05-10
12
Recommended proportions
 parallel – 10%
 strongly comparable (heavily edited translations
or independent, but closely related texts
reporting the same event or describing the same
subject) – 40%
 weakly comparable (e.g.,texts within the same
broader domain and genre, but varying in
subdomains and specific genres, texts in the same
narrow subject domain and genre, but describing
different events) – 50%
 length of each document should be between 500
and 3000 words
3rd BUCC
Malta
22-05-10
13
Initial comparable corpora: results
Domain
International
news
Sports
Admin
Travel
Software
Software
Medicine
Medicine
3rd BUCC
Malta
22-05-10
Genre
Planned
Newswires 20%
Newswires
Legal
Advice
Wikipedia
User
manuals
For
doctors
For
patients
Collected
14,73%
10%
10%
10%
15%
15%
8,23%
11%
14,46%
5,83%
22,11%
10%
12,35%
10%
11,30%
14
Initial comparable corpora: results
ET-EN LV-EN LT-EN EL-EN RO-EL HR-EN RO-EN RO-DE SL-EN
parallel
9,48 11,82 46,17 13,33 32,62 39,51
6,94
8,52 40,17
strongly
compara
ble
51,06 37,51 21,83 20,47 30,96
9,44 17,07 32,67 27,98
weakly
compara
ble
39,46 50,67 32,00 66,20 36,42 51,05 76,00 58,81 31,85
3rd BUCC
Malta
22-05-10
15
Metadata







Language
Domain
Genre
Source
Number of words
IPR status
Comparability level
parallel and strongly comparable texts are also aligned
at the document level
3rd BUCC
Malta
22-05-10
16
CES (Corpus Encoding Standards)
3rd BUCC
Malta
22-05-10
17
Extension to CES-CCES
3rd BUCC
Malta
22-05-10
18
CES Alignment–Extension to CCES Alignment
3rd BUCC
Malta
22-05-10
19
Criteria of Comparability and Parallelism
 Lack of definite methods to determine the criteria of
comparability
 Some attempts to measure the degree of comparability according
to distribution of topics and publication dates of documents in
comparable corpora to estimate the global comparability of the
corpora (Saralegi et al., 2008)
 Some attempts to determine different kinds of document
parallelism in comparable corpora, such as complete parallelism,
noisy parallelism and complete non-parallelism
 Some attempts to define criteria of parallelism of similar
documents in comparable corpora, such as similar number of
sentences, sharing sufficiently many links (up to 30%), and
monotony of links (up to 90% of links do not cross each other)
(Munteanu, 2006)
3rd BUCC
Malta
22-05-10
20
Criteria of Comparability and Parallelism
 To investigate criteria for comparability
between corpora concentrating on different
sets of features:
 Lexical features: measuring the degree of 'lexical
overlap' between frequency lists derived from
corpora
 Lexical sequence features: computing N-gram
distances in terms of tokens
 Morpho-syntactic features: computing N-gram
distances in terms of Part-of-Speech codes
3rd BUCC
Malta
22-05-10
21
First experiment
 Comparability of corpora is measured in terms of
lexical features (Greek—English and German—English
language pairs)
 The set-up is similar to (Kilgarriff, 2001):
 For each corpus take the top 500 most frequent words
 relative frequency is used (the absolute frequency, or
the word count, divided by the length of the corpus)
 Automatically generated dictionaries by Giza++ from
the parallel Europarl corpus
 We compare corpora pairwise using a standard ChiSquare distance measure:
ChiSquare = ∑
{w1... w500}((FrqObserved
- FrqExpected) ^ 2) /
FrqObserved
3rd BUCC
Malta
22-05-10
22
First experiment
 Asymmetric method: relative frequencies in Corpus in
language A are treated as “expected” values, and
those mapped from the Corpus in language B – as
“observed”. Then we swap Corpora A and B and
repeat the calculation. Asymmetry comes from words
which are missing in one of the lists as compared to
the other. Missing words have different relative
frequencies that are added to the score, so distance
from A to B can be different than from B to A. We use
the minimum of these distances as the final score for
the pair of corpora.
3rd BUCC
Malta
22-05-10
23
Features
 To extract the features which may be used to identify
the comparability between documents
Language Independent
Language Dependent
(requires translation)
• Document length
• Date
• Character overlap
• Web features
- URL of doc source
- Common links
- Links referring to each other
- Image links
• Other features …
• Lexical overlap
• Web features
- Anchor text
- Image alt tag
• Genre (?)
• Domain (?)
• Other features …
24
General Idea
EN
parallel
EL
Features
extraction
Comparability Level
f1
f2
f3
…
fn
parallel
strongly comparable
weakly comparable
strongly
comparable
EL
EN
weakly
comparable
EL
EN
not
comparable
EN
not comparable
...
New Documents
EL
Predicted
Comparability Level
EN
Initial Comparable Corpora
strongly
comparable
EL
Classifier
25
Metrics of Comparability and Parallelism
3rd BUCC
Malta
22-05-10
 Using defined criteria for parallelism, we would like to
develop formal automated metrics for determining the
degree of comparability
 Lack of comparability metrics to evaluate corpus
usability for different tasks, such as machine
translation, information extraction, cross-language
information retrieval
 Recent studies (Kilgarriff, 2001; Rayson and Garside,
2000) have added a quantitative dimension to the
issue of comparability by studying objective measures
for detecting how similar (or different) two corpora
are in terms of their lexical content
 Further studies (Sharoff, 2007) investigated automatic
ways for assessing the composition of web corpora in
26
terms of domains and genres
Sentence Alignment on Parallel Texts
3rd BUCC
Malta
22-05-10
 Danielsson, Pernilla and Ridings, Daniel. Practical presentation of a
vanilla aligner. Goteborgs universitet, 1997
 Melamed, Dan. A Geometric Approach to Mapping Bitext
Correspondence. University of Pennsylvania, 1996
 Chen, Stanley F. Aligning sentences in bilingual corpora using
lexical information. In Proceedings of the 31st annual meeting on
Association for Computational Linguistics (Columbus, Ohio 1993),
Association for Computational Linguistics Morristown, NJ, USA, 916
 State of the Art: Moore, Robert C. Fast and Accurate Sentence
Alignment of Bilingual Corpora. In Proceedings of the 5th
Conference of the Association for Machine Translation in the
Americas (Tiburon, California 2002), Springer-Verlag, Heidelberg,
135-244:
 provisionary alignment based on sentence lengths
 IBM Model 1 – estimate Translation Equivalents (TE) table
 generate one to one links based on sentence lengths and TE
27
table
Our Sentence Alignment on Parallel Texts
 Reification – a link in the alignment is treated
as a context independent structured object.
 Using SVM (libsvm solution).
 Features:




translation equivalence
word length correlation (Pearson)
special characters occurrence similarity
word frequency ranks correlation
 Crossed links are allowed
3rd BUCC
Malta
22-05-10
28
Scenarios for Aligning Comparable Corpora
 Based on previous experience, literature and
current constraints (time, man-power,
computational resources) we envisaged 3
possible ways of tackling with the alignment
of comparable corpora in order to get useful
results:
 QA techniques
 Clustering
 Windowing
3rd BUCC
Malta
22-05-10
29
3rd BUCC
Malta
22-05-10
Accurat partners
Tilde (Coordinator)
Latvia
University of Sheffield
UK
University of Leeds
UK
Athena Research and Innovation Center in
Information Communication and Knowledge
Technologies (ILSP)
Greece
University of Zagreb, Faculty of Humanities and
Social Sciences
Croatia
DFKI
Germany
Institute of Artificial Intelligence
Romania
Linguatec
Germany
Zemanta
Slovenia
30
ACCURAT project has received funding from the
EU 7th Framework Programme for Research and Technological
Development under
Grant Agreement N° 248347
Project duration: January 2010 – June 2012
Contact information:
Andrejs Vasiljevs
andrejs tilde.lv
Tilde, Vienibas gatve 75a, Riga
LV1004, Latvia
www.accurat-project.eu