Document

Transcript Document

data iss core
andrejs vasiļjevs
chairman of the board
[email protected]
LOCALIZATION WORLD PARIS, JUNE 5, 2012
• Language technology
developer
• Localization service
provider
• Leadership in smaller
languages
• Offices in Riga (Latvia),
Tallinn (Estonia) and Vilnius
(Lithuania)
• 135 employees
• Strong R&D team
• 9 PhDs and candidates
machine translation
machine translation
d i s r u p t i v e
INNOVATION
d i s r u p t i v e
rule-based MT
MT paradigms
• High quality translation in specialized domains
• Require highly qualified
linguists, researchers and software developers
• Time and resource consuming
• Difficult to evolve
statistical MT
•
•
•
•
Translation and linguistic knowledge is derived from data
Relatively easy and quick to develop
Requires huge amounts of parallel and monolingual data
Translation quality inconsistent and can differ dramatically from
domain to domain
CHALLENGE
IT
Aerospace
Agriculture
Automotive
Chemistry
Coal and mining industries
Communications
Culture
Defence
Education
Electronics
Energy
Finance
Food technology
Government affairs
Legal
Life sciences
Logistics
Marketing
Mechanical engineering
Medicine
Pharmaceuticals
Religion
Social affairs
Trade
one size
fits all
?
The total body of European Union law
applicable in the EU Member States
JRC-Acquis
http://langtech.jrc.it/JRC-Acquis.html
The DGT Multilingual
Translation Memory of the
Acquis Communautaire
DGT-TM
http://langtech.jrc.it/DGT-TM.html
Parallel data collected from the
Web by University of Uppsala
90 languages, 3800 language
2,7B parallel units
Opus
http://opus.lingfil.uu.se
open
European
language resource
infrastructure
http://www.meta-net.eu
Data for SMT training
PLATFORM
[ttable-file]
0 0 5 /.../unfactored/model/phrase-table.0-0.gz
% ls steps/1/LM_toy_tokenize.1* | cat
steps/1/LM_toy_tokenize.1
steps/1/LM_toy_tokenize.1.DONE
steps/1/LM_toy_tokenize.1.INFO
steps/1/LM_toy_tokenize.1.STDERR
steps/1/LM_toy_tokenize.1.STDERR.digest
steps/1/LM_toy_tokenize.1.STDOUT
% train-model.perl \
--corpus factored-corpus/proj-syndicate \
--root-dir unfactored \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm:0
% moses -f moses.ini -lmodel-file "0 0 3
../lm/europarl.srilm.gz“
use-berkeley = true
alignment-symmetrization-method = berkeley
berkeley-train = $moses-scriptdir/ems/support/berkeley-train.sh
berkeley-process = $moses-scriptdir/ems/support/berkeley-process.sh
berkeley-jar = /your/path/to/berkeleyaligner2.1/berkeleyaligner.jar
berkeley-java-options = "-server -mx30000m -ea"
berkeley-training-options = "-Main.iters 5 5 EMWordAligner.numThreads 8"
berkeley-process-options = "EMWordAligner.numThreads 8"
berkeley-posterior = 0.5
tokenize
in: raw-stem
out: tokenized-stem
default-name: corpus/tok
pass-unless: input-tokenizer output-tokenizer
template-if: input-tokenizer IN.$inputextension OUT.$input-extension
template-if: output-tokenizer IN.$outputextension OUT.$output-extension
parallelizable: yes
working-dir = /home/pkoehn/experiment
Moses
toolkit
build
your own
MT engine
Tilde / Coordinator
LATVIA
University of Edinburgh
UK
Uppsala University
SWEDEN
Copehagen University
DENMARK
University of Zagreb
CROATIA
Moravia
CZECH REPUBLIC
SemLab
NETHERLANDS
• Cloud-based self-service
MT factory
• Repository of parallel and
monolingual corpora for MT
generation
• Automated training of SMT
systems from specified
collections of data
• Users can specify particular
training data collections and
build customised MT engines
from these collections
• Users can also use LetsMT!
platform for tailoring MT system
to their needs from their nonpublic data
• Stores SMT training data
• Supports different formats –
TMX, XLIFF, PDF, DOC, plain
text
• Converts to unified format
Resource
Repository
• Performs format
conversions and alignment
• Put users in control of
their data
• Fully public or fully
private should not be
the only choice
• Data can be used for
MT generation without
exposing it
user-driven
machine
translation
• Empower users to
create custom MT
engines from their
data
• Integration with CAT tools
• Integration in web pages
• Integration in web browsers
• API-level integration
integration
Integration of MT in SDL Trados
Sharing of training data
Training
Using
SMT Resource
Directory
Giza++
Moses SMT toolkit
SMT Multi-Model
Repository
(trained SMT models)
Anonymous
access
SMT Resource
Repository
Web page
translation widget
Web browser
Plug-ins
SMT System
Directory
Moses decoder
System management, user authentication, access rights control ...
Authenticated
access
Procesing, Evaluation ...
Upload
Web page
Web service
CAT tools
use case
FORTERA
EVALUATION
• Keyboard-monitoring of postediting
(O´Brien, 2005)
• Productivity of MS Office
localization (Schmidtke, 2008)
5-10% productivity gain for SP, FR, DE
• Adobe
(Flournoy and Duran, 2009)
22%-51% productivity increase for
RU, SP, FR
• Autodesk Moses SMT system
(Plitt and Masselot, 2010)
Previous Work
74% average productivity increase
for FR, IT, DE, SP
• Latvian:
 About 1,6 M native speakers
 Highly inflectional - ~22M possible
word forms in total
 Official EU language
• Tilde English – Latvian MT
system
• IT Software Localization
Domain
• Evaluation of translators’
productivity
Evaluation at
Tilde
Bilingual corpus
Localization TM
DGT-TM
OPUS EMEA
Fiction
Dictionary data
Web corpus
Total
English-Latvian
data
Parallel units
1 290 K
1 060 K
970 K
660 K
510 K
900 K
5 370 K
Monolingual corpus
Latvian side of parallel
corpus
Words
60 M
News (web)
Fiction
Total, Latvian
250 M
9M
319 M
Evaluate original / assign
Translator and Editor
Analyze against TMs
MT translate new sentences
Translate
using translation suggestions for TMs
and MT
Evaluate translation quality /
Edit
Fix errors
MT Integration into
Localization
Workflow
Ready translation
• Key interest of localization
industry is to increase
productivity of translation
process while maintaining
required quality level
• Productivity was measured as
the translation output of an
average translator in words per
hour
Evaluation of
Productivity
• 5 translators participated in
evaluation including both
experienced and new translators
• Performed by human editors as part of
their regular QA process
• Result of translation process was
evaluated, editors did not know was or
was not MT applied to assist translator
• Comparison to reference is not part of
this evaluation
• Tilde standard QA assessment form was
used covering the following text quality
areas:
 Accuracy
 Spelling and grammar
Evaluation of
Quality
 Style
 Terminology
QA Grades
Error Score
(sum of weighted errors)
Resulting Quality
Evaluation
0…9
Superior
10…29
Good
30…49
Mediocre
50…69
Poor
>70
Very poor
Tilde Localization QA assessment applied
in the evaluation
►54 documents in IT domain
►950-1050 adjusted words in
each document
►Each document was split in
half:
►the first part was translated using
suggestions from TM only
►the second half was translated
using suggestions from both TM
and MT
Evaluation data
Latvian
32.9%*
%
productivity
* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in
localization to under-resourced inflected language, in Proceedings
of the 15th International Conference of the European Association
for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011,
Leuven, Belgium
► IT Localization domain
► Systems trained on the
LetsMT platform
► English - Czech translation
 25.1% productivity
increase
 Error score increase from
19 to 27, still at the
GOOD grade (<30)
► English – Polish translation
 28.5% productivity
increase
Evaluation at
Moravia
 Error score increase from
16.8 to 23.6, still at the
GOOD grade (<30)
Slovak* Czech Polish
productivity
28.5%
25.1%
25%
%
*For Czech and Polish formal evaluation was done by Moravia
Foror Slovak productivity increase was estimated by Fortera
MORE
DATA
corpora collection
tools
comparability
metrics
named entity
recognition tools
terminology
extraction tools
ACCURAT TOOLKIT
use case
AUTOMOTIVE
MANUFACTURER
very small
translation memories
(just 3500 sentences)
no
in-domain corpora
in target languages
no
money for expensive
developments
Terminology
extraction
Web crawling
parallel
monolingual
Parallel data
extraction from
comparable corpora
data collection
workflow
TMs
Terminology glossary
Parallel phrases
Parallel Named
Entities
Monolingual target
language corpus
Resulting data
General domain data
as a basis
Domain specific
language model
Impose domain
specific terminology,
named entity
translations
SMT Training
Add linguistic
knowledge atop of
statistical components
right data
&
right tools
tilde.com
technologies
for
smaller
languages
The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support
Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456