Tapta4IPC: helping translation of IPC definitions Translation assistant for patent titles and abstracts in PATENTSCOPE potential use in translating IPC definitions.

Download Report

Transcript Tapta4IPC: helping translation of IPC definitions Translation assistant for patent titles and abstracts in PATENTSCOPE potential use in translating IPC definitions.

Tapta4IPC: helping translation of IPC definitions
Translation assistant for patent titles and abstracts in PATENTSCOPE potential use in translating IPC definitions collaboration
Bruno Pouliquen ([email protected])
25 feb 2013, IPC workshop
Introduction
Statistical Machine Translation: bottom-up approach
no rules, no grammar, no dictionary, no terminology,
only the parallel texts (bitexts)
We use an open-source system: Moses
Tapta: Translation of Patent Titles and Abstract
• Originally built to translate patent applications
• Adapted to various applications
system
data
Tapta framework
source
language
Gather/convert
data
target
language
Bitexts
Our system prepares the data for Moses,
apply some post-processing (filter, pruning,
binarization, optimization…) and offers a Web
interface to translate
clean
post-filter
re-clean
train-model
prune
binarize
optimize
Publish
Introduction: Tapta
In WIPO, as part of Patentscope (English,French,German,Chinese,Japanese)
eg. http://patentscope.wipo.int/translate/simpleTranslate.jsf?id=JP75694586&langpair=jaen
Automatic translation of a patent application only available in Japanese…
In United Nations (English from/into Arabic,French,Spanish,Russian & Chinese)
Technical workflow
source
language
Filter wrong
language
Translation
client
Sentence-split
target
language
Tokenization
Bitexts
Translation
server
Sentence-align
Moses decoder
Moses decoder
Moses decoder
Score
alignment
Filter align.
phrase
table
En
reordering
model
Es
Strengthening of forum for
human dignity : legal aid
Fortalecimiento del foro para la
dignidad humana – asistencia jurídica
must respect all aspects of
human dignity
debe respetar todos los aspectos de
la dignidad humana
should fully respect human
dignity
se deben respetar plenamente la
dignidad humana
Bitexts aligned at sentence level
Moses’ training
language
model
IPC context
• Gather data:
– Get existing definitions
– Add IPC schema (xml on WIPO website)
– Add “few” texts from patents
• “learn” translation model
• Translate new texts
Get existing data, build parallel texts
Existing definitions…
Wheels
Bitext: training material…
roues
Wheel guards
Couvre-roues
Tyre for vehicle wheels
Pneumatique pour roues de véhicule
IPC schema…
<ipcEntry kind="1" symbol="B61F0019020000"
<ipcEntry kind="1" symbol="B61F0019020000" ipcLevel="A"
ipcLevel="A" entryType="K" lang="FR">
entryType="K" lang="EN">
<textBody> <title><titlePart>
<textBody> <title> <titlePart>
<text>Couvre-roues</text>
<text>Wheel guards</text>
</titlePart></title></textBody>
</titlePart></title></textBody>
</ipcEntry>
</ipcEntry>
Patent texts…
WO/2013/014517
(EN) TYRE FOR VEHICLE WHEELS
(FR) PNEUMATIQUE POUR ROUES DE VÉHICULE
How well it works?
Automatic evaluation: BLEU score
Principle : similarity of n-grams between evaluated and
reference sentences
On IPC definition English-French: bleu=48%
(without patent data: 44%)
Good quality
needs human post-editing
Tapta4IPC prototype (1)
Live demo using:
http://patentscope.wipo.int/translateUN/translateIPC.jsf
Tapta4IPC prototype (2)
http://fulty3.wipo.int:8080/Wtapta/translateIPC.jsf
Conclusion / future work
This is a prototype, but the quality looks already
acceptable
Human evaluation?
Better integrate the tool
In PCA6TRANSDEF ?
Other languages?
Tapta4IPC in various languages
Tapta4IPC should work reasonably well on the following languages (we
have built some language specific tools and we have patent corpora):
• German
• Japanese
• Korean
• Spanish
• Dutch
• Portuguese
• Chinese
• Russian
More challenging:
• Czech, Slovak, Polish (many word forms, training corpus?)
• Estonian (even more word forms, would in theory require more
training corpus)
Other languages: Arabic, Italian, Danish, Swedish etc.
Thank you for your attention
‫شكرا لكم على اهتمامكم‬
Merci pour votre attention!
感谢您的关注
Grazie per la vostra attenzione!
¡ Gracias por su atención !
Vielen Dank für Ihre Aufmerksamkeit!
Obrigado pela vossa atenção!
Dziękuję bardzo za Państwa uwagę!
Děkujeme za Vaši pozornost!
Ďakujem ti veľmi pekne za tvoju pozornosť
Tänan tähelepanu eest!
Благодарим за Вашето внимание!
Tak for Jeres opmærksomhed!