Transcript Document
data iss core andrejs vasiļjevs chairman of the board [email protected] LOCALIZATION WORLD PARIS, JUNE 5, 2012 • Language technology developer • Localization service provider • Leadership in smaller languages • Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania) • 135 employees • Strong R&D team • 9 PhDs and candidates machine translation machine translation d i s r u p t i v e INNOVATION d i s r u p t i v e rule-based MT MT paradigms • High quality translation in specialized domains • Require highly qualified linguists, researchers and software developers • Time and resource consuming • Difficult to evolve statistical MT • • • • Translation and linguistic knowledge is derived from data Relatively easy and quick to develop Requires huge amounts of parallel and monolingual data Translation quality inconsistent and can differ dramatically from domain to domain CHALLENGE IT Aerospace Agriculture Automotive Chemistry Coal and mining industries Communications Culture Defence Education Electronics Energy Finance Food technology Government affairs Legal Life sciences Logistics Marketing Mechanical engineering Medicine Pharmaceuticals Religion Social affairs Trade one size fits all ? The total body of European Union law applicable in the EU Member States JRC-Acquis http://langtech.jrc.it/JRC-Acquis.html The DGT Multilingual Translation Memory of the Acquis Communautaire DGT-TM http://langtech.jrc.it/DGT-TM.html Parallel data collected from the Web by University of Uppsala 90 languages, 3800 language 2,7B parallel units Opus http://opus.lingfil.uu.se open European language resource infrastructure http://www.meta-net.eu Data for SMT training PLATFORM [ttable-file] 0 0 5 /.../unfactored/model/phrase-table.0-0.gz % ls steps/1/LM_toy_tokenize.1* | cat steps/1/LM_toy_tokenize.1 steps/1/LM_toy_tokenize.1.DONE steps/1/LM_toy_tokenize.1.INFO steps/1/LM_toy_tokenize.1.STDERR steps/1/LM_toy_tokenize.1.STDERR.digest steps/1/LM_toy_tokenize.1.STDOUT % train-model.perl \ --corpus factored-corpus/proj-syndicate \ --root-dir unfactored \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 % moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“ use-berkeley = true alignment-symmetrization-method = berkeley berkeley-train = $moses-scriptdir/ems/support/berkeley-train.sh berkeley-process = $moses-scriptdir/ems/support/berkeley-process.sh berkeley-jar = /your/path/to/berkeleyaligner2.1/berkeleyaligner.jar berkeley-java-options = "-server -mx30000m -ea" berkeley-training-options = "-Main.iters 5 5 EMWordAligner.numThreads 8" berkeley-process-options = "EMWordAligner.numThreads 8" berkeley-posterior = 0.5 tokenize in: raw-stem out: tokenized-stem default-name: corpus/tok pass-unless: input-tokenizer output-tokenizer template-if: input-tokenizer IN.$inputextension OUT.$input-extension template-if: output-tokenizer IN.$outputextension OUT.$output-extension parallelizable: yes working-dir = /home/pkoehn/experiment Moses toolkit build your own MT engine Tilde / Coordinator LATVIA University of Edinburgh UK Uppsala University SWEDEN Copehagen University DENMARK University of Zagreb CROATIA Moravia CZECH REPUBLIC SemLab NETHERLANDS • Cloud-based self-service MT factory • Repository of parallel and monolingual corpora for MT generation • Automated training of SMT systems from specified collections of data • Users can specify particular training data collections and build customised MT engines from these collections • Users can also use LetsMT! platform for tailoring MT system to their needs from their nonpublic data • Stores SMT training data • Supports different formats – TMX, XLIFF, PDF, DOC, plain text • Converts to unified format Resource Repository • Performs format conversions and alignment • Put users in control of their data • Fully public or fully private should not be the only choice • Data can be used for MT generation without exposing it user-driven machine translation • Empower users to create custom MT engines from their data • Integration with CAT tools • Integration in web pages • Integration in web browsers • API-level integration integration Integration of MT in SDL Trados Sharing of training data Training Using SMT Resource Directory Giza++ Moses SMT toolkit SMT Multi-Model Repository (trained SMT models) Anonymous access SMT Resource Repository Web page translation widget Web browser Plug-ins SMT System Directory Moses decoder System management, user authentication, access rights control ... Authenticated access Procesing, Evaluation ... Upload Web page Web service CAT tools use case FORTERA EVALUATION • Keyboard-monitoring of postediting (O´Brien, 2005) • Productivity of MS Office localization (Schmidtke, 2008) 5-10% productivity gain for SP, FR, DE • Adobe (Flournoy and Duran, 2009) 22%-51% productivity increase for RU, SP, FR • Autodesk Moses SMT system (Plitt and Masselot, 2010) Previous Work 74% average productivity increase for FR, IT, DE, SP • Latvian: About 1,6 M native speakers Highly inflectional - ~22M possible word forms in total Official EU language • Tilde English – Latvian MT system • IT Software Localization Domain • Evaluation of translators’ productivity Evaluation at Tilde Bilingual corpus Localization TM DGT-TM OPUS EMEA Fiction Dictionary data Web corpus Total English-Latvian data Parallel units 1 290 K 1 060 K 970 K 660 K 510 K 900 K 5 370 K Monolingual corpus Latvian side of parallel corpus Words 60 M News (web) Fiction Total, Latvian 250 M 9M 319 M Evaluate original / assign Translator and Editor Analyze against TMs MT translate new sentences Translate using translation suggestions for TMs and MT Evaluate translation quality / Edit Fix errors MT Integration into Localization Workflow Ready translation • Key interest of localization industry is to increase productivity of translation process while maintaining required quality level • Productivity was measured as the translation output of an average translator in words per hour Evaluation of Productivity • 5 translators participated in evaluation including both experienced and new translators • Performed by human editors as part of their regular QA process • Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator • Comparison to reference is not part of this evaluation • Tilde standard QA assessment form was used covering the following text quality areas: Accuracy Spelling and grammar Evaluation of Quality Style Terminology QA Grades Error Score (sum of weighted errors) Resulting Quality Evaluation 0…9 Superior 10…29 Good 30…49 Mediocre 50…69 Poor >70 Very poor Tilde Localization QA assessment applied in the evaluation ►54 documents in IT domain ►950-1050 adjusted words in each document ►Each document was split in half: ►the first part was translated using suggestions from TM only ►the second half was translated using suggestions from both TM and MT Evaluation data Latvian 32.9%* % productivity * Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium ► IT Localization domain ► Systems trained on the LetsMT platform ► English - Czech translation 25.1% productivity increase Error score increase from 19 to 27, still at the GOOD grade (<30) ► English – Polish translation 28.5% productivity increase Evaluation at Moravia Error score increase from 16.8 to 23.6, still at the GOOD grade (<30) Slovak* Czech Polish productivity 28.5% 25.1% 25% % *For Czech and Polish formal evaluation was done by Moravia Foror Slovak productivity increase was estimated by Fortera MORE DATA corpora collection tools comparability metrics named entity recognition tools terminology extraction tools ACCURAT TOOLKIT use case AUTOMOTIVE MANUFACTURER very small translation memories (just 3500 sentences) no in-domain corpora in target languages no money for expensive developments Terminology extraction Web crawling parallel monolingual Parallel data extraction from comparable corpora data collection workflow TMs Terminology glossary Parallel phrases Parallel Named Entities Monolingual target language corpus Resulting data General domain data as a basis Domain specific language model Impose domain specific terminology, named entity translations SMT Training Add linguistic knowledge atop of statistical components right data & right tools tilde.com technologies for smaller languages The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456