Tapta4IPC: helping translation of IPC definitions Translation assistant for patent titles and abstracts in PATENTSCOPE potential use in translating IPC definitions.
Download ReportTranscript Tapta4IPC: helping translation of IPC definitions Translation assistant for patent titles and abstracts in PATENTSCOPE potential use in translating IPC definitions.
Tapta4IPC: helping translation of IPC definitions Translation assistant for patent titles and abstracts in PATENTSCOPE potential use in translating IPC definitions collaboration Bruno Pouliquen ([email protected]) 25 feb 2013, IPC workshop Introduction Statistical Machine Translation: bottom-up approach no rules, no grammar, no dictionary, no terminology, only the parallel texts (bitexts) We use an open-source system: Moses Tapta: Translation of Patent Titles and Abstract • Originally built to translate patent applications • Adapted to various applications system data Tapta framework source language Gather/convert data target language Bitexts Our system prepares the data for Moses, apply some post-processing (filter, pruning, binarization, optimization…) and offers a Web interface to translate clean post-filter re-clean train-model prune binarize optimize Publish Introduction: Tapta In WIPO, as part of Patentscope (English,French,German,Chinese,Japanese) eg. http://patentscope.wipo.int/translate/simpleTranslate.jsf?id=JP75694586&langpair=jaen Automatic translation of a patent application only available in Japanese… In United Nations (English from/into Arabic,French,Spanish,Russian & Chinese) Technical workflow source language Filter wrong language Translation client Sentence-split target language Tokenization Bitexts Translation server Sentence-align Moses decoder Moses decoder Moses decoder Score alignment Filter align. phrase table En reordering model Es Strengthening of forum for human dignity : legal aid Fortalecimiento del foro para la dignidad humana – asistencia jurídica must respect all aspects of human dignity debe respetar todos los aspectos de la dignidad humana should fully respect human dignity se deben respetar plenamente la dignidad humana Bitexts aligned at sentence level Moses’ training language model IPC context • Gather data: – Get existing definitions – Add IPC schema (xml on WIPO website) – Add “few” texts from patents • “learn” translation model • Translate new texts Get existing data, build parallel texts Existing definitions… Wheels Bitext: training material… roues Wheel guards Couvre-roues Tyre for vehicle wheels Pneumatique pour roues de véhicule IPC schema… <ipcEntry kind="1" symbol="B61F0019020000" <ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A" ipcLevel="A" entryType="K" lang="FR"> entryType="K" lang="EN"> <textBody> <title><titlePart> <textBody> <title> <titlePart> <text>Couvre-roues</text> <text>Wheel guards</text> </titlePart></title></textBody> </titlePart></title></textBody> </ipcEntry> </ipcEntry> Patent texts… WO/2013/014517 (EN) TYRE FOR VEHICLE WHEELS (FR) PNEUMATIQUE POUR ROUES DE VÉHICULE How well it works? Automatic evaluation: BLEU score Principle : similarity of n-grams between evaluated and reference sentences On IPC definition English-French: bleu=48% (without patent data: 44%) Good quality needs human post-editing Tapta4IPC prototype (1) Live demo using: http://patentscope.wipo.int/translateUN/translateIPC.jsf Tapta4IPC prototype (2) http://fulty3.wipo.int:8080/Wtapta/translateIPC.jsf Conclusion / future work This is a prototype, but the quality looks already acceptable Human evaluation? Better integrate the tool In PCA6TRANSDEF ? Other languages? Tapta4IPC in various languages Tapta4IPC should work reasonably well on the following languages (we have built some language specific tools and we have patent corpora): • German • Japanese • Korean • Spanish • Dutch • Portuguese • Chinese • Russian More challenging: • Czech, Slovak, Polish (many word forms, training corpus?) • Estonian (even more word forms, would in theory require more training corpus) Other languages: Arabic, Italian, Danish, Swedish etc. Thank you for your attention شكرا لكم على اهتمامكم Merci pour votre attention! 感谢您的关注 Grazie per la vostra attenzione! ¡ Gracias por su atención ! Vielen Dank für Ihre Aufmerksamkeit! Obrigado pela vossa atenção! Dziękuję bardzo za Państwa uwagę! Děkujeme za Vaši pozornost! Ďakujem ti veľmi pekne za tvoju pozornosť Tänan tähelepanu eest! Благодарим за Вашето внимание! Tak for Jeres opmærksomhed!