Document 7729845

Download Report

Transcript Document 7729845

A roadmap for MT : four « keys »
to handle more languages, for all kinds of tasks,
while making it possible to improve quality (on demand)
International Conference on Universal Knowledge and Language
(ICUKL2002), Goa, 25-29 November 2002
Christian Boitet
GETA, CLIPS, IMAG, 385 av. de la bibliothèque, BP 53
F-38041 Grenoble cedex 9, France
[email protected], http://clips.imag.fr/geta
Outline
• Basic concepts
What is MT ?
Goals: Quality / User
Architectures: Vauquois' triangle
• State of the art
MT of texts: examples, problems
MT of spoken dialogs
• The future of MT
Goals
4 keys
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
2/30
What is M(a)T ?
• At least 3 types of automation
MT = Machine Translation
MAT = Machine Assisted Translation
MAHT = Machine Aided Human Translation
• A scientific technology
Informatics (computer science)
Linguistics
Mathematics
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
3/30
Goals: Quality / User
User
Quality
rough, qu ick
from raw to
very good
Ch. Boitet
lingui stically
naive
lingui stically
specialized
MT for access
MT for
translators
special fields :
atom, chemistry…
general information
helps: lexicons,
proposals from a
translation memory…
MT for
individual
authors
MT for revisors
(posteditors)
with interactive
disambiguation
raw MT, polishable
ICUKL2002, Goa, 25-29/11/2002
4/30
Architectures: Vauquois' triangle
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
5/30
Architekturen: Vauquois Dreieck (größer)
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
6/30
Formal intermediate structures
Linguistic
level(s)
Linguistic
main
organization
Geometrical
structure
String
Surface
Deep
1-level
n-level
Ch. Boitet
Syntagms
Algebraic
structure
Labels
Struct. string
(constituents) Chain graph
(chart) Boolean
Dependencies
Tree structure features
Logical and
semantic
relations
Graph /
Network
Structured
attributes
Hypergraph
Feature
structures
ICUKL2002, Goa, 25-29/11/2002
Correspondence
Structure—Text
Scope
Sentence
concrete
(text - readable
from structure)
(almost all)
Paragraph
Page
abstract
(Ariane-G5,
Sygma rt)
(e.g. UNL)
Document
7/30
How to produce an MT system
• Choose an architecture
• Program the "tools"
Spezialized languages for linguistic programming (SSLP)
Development environment (MT shell)
• Build the "lingware"
Lexical data / rules / weights
Grammatical data / rules / weights
Possible specialization to a typology ("sublanguage")
• How?
Human work ± computer help / support
Automatic learning (weights, likeliness…)
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
8/30
State of affairs
• only a small number of language pairs is covered by MT
systems designed for information access
Systran EC (2000): 19/1 10 language pairs, 8 OK for intended use
See also examples by Ronaldo Martins
• even fewer are capable of quality translation or speech
translation
• Now a few examples…
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
9/30
Examples: MT for access, Web (1)
ENGLISH (human ver sion)
FRENCH (hu ma n ve rsion)
ENGLISH (Systran FRE-ENG
version)
The European-Heritage.net
thesaurus covers the fields of
archaeology and architecture as
defined in the Council of Europe
conventions signed in Granada
(1985) and Malta (1992).
It encompasses information
ranging from t he partners
involved, categories of cultural
assets and legislation, to activities,
skills and funding. It is
supplemented by a number of
specific thesauruses compiled by
each memb er state on a particular
topic, such as the thesaurus on
Andalusian heritage or the
architectural thesaurus from t he
Mérimée database in France.
This new, open-ended search tool
will come on line shortly, together
with a management and
administration system shared
amo ng the various contributors.
Le thesaurus European-Heritage.net
couvre les champs de l'archéologie et
de l'architecture au sens des
conventions du Conseil de l'Eu rope de
Grenade (1985) et de Malte (1992).
The European-Heritage.net thesaurus
covers the fields of archaeology and
architecture within the meaning of
conventions of the Council of Europe
of Gr enade (1985) and Malta (1992).
Il prend en compte des aspects aussi
variés que les acteurs, les catégories
de biens culturels, la législation ou
encore les interventions, les métiers et
les financeme nts. Il e st complété et
prolongé par des thesaurus spécifiques
développés par chaque Eta t membre
sur tel ou tel sujet spécifique, comm e
le thesaurus du patrimo ine historique
andalou ou le thesaurus d'architecture
de la base de données documentaire
Mérimée en France.
Cet instrument de recherche,
forcément évolutif, sera mis
prochainement en ligne accomp agné
d'un dispositif de gestion et
d'administration réparti e ntre les
différents contributeurs.
It takes into account aspects as varied
as the actors, the categories of cultural
goods, the legislation or the
interventions, the trades and the
financings. It is supplemented and
prolonged by thesaurus specific
developed by each Member State on
such or such specific subject, li ke the
thesaurus of the Andalusian historical
inheritance or the thesaurus of
architecture of the documentation data
base Mérimée in France.
This instrument of search, inevitably
evolutionary, will be put soon on line
accompanied by a device of
manageme nt and administration
distributed between the various
contributors.
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
10/30
Examples: MT for access, Web (2)
• FE quite "easy", compared with EG and mainly FG
GERMAN (Systran ENG-GER ve rsion )
GERMAN (Systran FRE-GER ve rsion)
Der European-Heritage.netthesaurus umfaßt die
Felder von archaeology und von Architektur,
wie in den Europaratvereinbarungen definiert,
die in Gr anada (1985) unterzeichnet werden und
in Malta (1992).
Er gibt die Informationen um, die von den
betroffenen Partnern, von den Kategorien der
kulturellen Werte und der Gesetzgebung, bis zu
Aktivitäten, von den Fähigkeiten und von der
Finanzierung reichen. Er wird durch eine Anzahl
von den spezi fischen Thesauren ergänzt, die
durch jeden Mitgliedsstaat auf einem
bestimmt en Thema, wie dem Thesaurus auf
Andalusian Erbe oder dem architektonischen
Thesaurus von der Datenbank Mérimée in
Frankreich kompiliert werden.
Der European-Heritage.net-Thesaurus bedeckt
die Felder der Archäologie und der Architektur
im Sinne der Übereinkommen des Europarats
von Granada (1985) und von Malta (1992).
Dieses neue, offene Suchhilfsmittel kommt auf
Zeile kurz, zusamm en mit einem M anageme ntund Lei tungssystem, das unter den
verschiedenen Mitwirkenden geteilt wird.
Ch. Boitet
Er berücksichtigt Aspekte dermaßen variierte,
daß die Beteiligten, die Kategorien kultureller
Güter, die Gesetzgebung oder noch die
Interventionen, die Berufe und die
Finanzierungen. Er wird vervollständigt und
wird durch ein spezifische Thesaurus entwickelt
durch jeder Mitgliedstaat über das eines oder
andere spezifische Thema verlängert, als der
Thesaurus des andalusischen historischen
Kulturgutes oder der Thesaurus der Architektur
der urkundlichen Datenbank Mérimée in
Frankreich.
Dieses notgedrungen entwicklungsfähige
Forschungsinstrume nt wird gestellt demnächst
online begleitet von einer Verwaltungs- und
Verwaltungsvorrichtung, die aufgeteilt unter den
verschiedenen Beitragenden.
ICUKL2002, Goa, 25-29/11/2002
11/30
Comparison: raw vs rough MT
SpanA m raw Spanish-Eng lis h ou tput
Rever so raw Spanish-Eng lis h ou tput
Message of the Director-General of the World
Health Organization
From its discovery, a ntibiotics have completely
transformed the perspective of humankind with
respect to infectious diseases. T oday the use of
antibiotics, combined with improvements in
sanitation, housing, and nutrition, together w ith the
advent of the vaccination programs generalized,
have caused a notable reduction of infectious
diseases that previously were commo n and
annihilated entire populations.
Scourges that terrified millions of people, as plague,
whooping cough, poliomy elitis, and the scarlatina,
have been controlled or are on the verge of being
controlled. Now, in the dawn of a new millennium,
humankind faces another crisis. Previously curable
diseases as the gonorrhea and typhoid fever are
becomi ng rapidly difficult to treat, while old
assassins as tuberculosis and malaria now are armed
of the increasingly impenetrable resistance to the
antimi crobial drugs.
This phenomenon is potentially contenible. The
problem is increasingly profound and complex,
accele rated by the abuse of antibiotics in the
developed countries and the paradoxical
underutilization of the quality antimicrobial drugs in
the developing countries due to the poverty and to
the scarcity resulting from an effective health care.
Message of the Chief operating officer of the World
Organization of the Health
From h is{*its*} discovery, the antibiotics have transformed
completely the perspective of the humanity with regard to
the infectious diseases. T oday the use of the antibiotics,
cocktail with improvements in the reparation, the housing
and the nutrition, together with the advent of the programs
of widespread vaccination, they have given place to a
notable decrease of infectious diseases that before were
common and were annihilating entire populations.
Ch. Boitet
Scourges that terrified million persons, a s the pest, the
savage cough, the poliomy elitis and the scarlatina, they have
been controlled or are on the verge of be controlling. Now,
in the dawn of a new millenium, the humanity faces with
another crisis. Diseases before curable as the gonorrhea and
the fever tifoidea they are becoming rapidly difficult to
treat, whereas killer old men as the tuberculosis and the
malaria are armed{*assemb led*} now with the increasing
imp enetrable resistance the antimicrobial ones.
This phenomenon is potentially contenible. The problem is
increasingly deep and comp lex, accelerated by the abuse of
the antibiotics in the developed countries and the
paradoxical subutilization of the antimi crobial ones of
quality in the countries in development due to the poverty
and the resultant shortage of an attention of effective health.
ICUKL2002, Goa, 25-29/11/2002
12/30
Examples: MT for revisors…
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
13/30
…with BV-aero/FE (2)
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
14/30
MT of spoken dialogs
• Specialized systems are already usable
e.g. ATR/Matsushita, IBM, CSTAR/Nespole!…
Much "noise" and "ungrammaticalities"
But specializing is very helpful!
• General systems are also possible
e.g. NEC/Xroad, Linguatec/Talk&Translate
Speech recognition is already good enough
Rough may be good enough (e.g. for chatting)
• Interpretation is different from translation…
…and participants are intelligent !
Similarity with access-oriented-MT
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
15/30
French-Korean through IF (1)
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
16/30
French-Korean through IF (2)
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
17/30
French-Korean through IF (3)
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
18/30
A road map… to which goals?
• MT of adequate quality
• Not only for access
• For all languages
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
19/30
Four keys
• 2 on the technical side
• 2 on the organizational side
Compromize: a far wider coverage, a somewhat smaller asymptotic quality
• Automatic learning techniques
• Using non-textual pivots (intermediate formal descriptors)
Democratization, cooperation
• Cooperative development of open source linguistic resources on the
Web
• Towards systems where quality can be improved "on demand" by
users
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
20/30
Learning techniques
• Extend the use of hybrid techniques
symbolic, numerical, or mixed
==> they have demonstrated their potential at the research level
• stochastic grammars
• weighted (or "neural") dictionaries
• or build new tools, intrinsically numerical
inspiration from voice recognition
• 2 examples
learning analyzers : text —> semantic tree (IBM)
learning implicit very detailed DG from tree bank (NAIST)
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
21/30
Using non-textual pivots
• Semantico-pragmatic (ontological) pivots
task & domain oriented ==> limited applicability
• Abstract linguistic descriptors
the most precise, but often too sophisticated
depend on each language
• Anglo-semantic pivot: UNL
"the HTML of linguistic content"
• in UNL, a hypergraph represents the abstract structure of
(supposedly) equivalent English utterance
less precise but "robust"
symbols constructed from English ==> usable by all developers
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
22/30
A simple UNL graph
score(icl>event,agt>human,fld>sport)
.@entry.@past.@complete
agt
Ronaldo
(icl>proper
noun)
obj
ins
plt
head(pof>body).@def
pos
corner(icl>thing).@def
goal(icl>abstract thing)
pos
goal(icl>concrete thing)
mod
left(aoj<thing)
•Ronaldo has headed the ball into the left corner of the goal
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
23/30
Cooperative development
• of open source linguistic resources
• on the Web
Mutualization is necessary at least for lexical knowledge
too costly even for the leaders
size (#entries) has to augment for each language (300K, 3M?)
#languages has to increase dramatically (11 —> 20 —> 180?)
Integration of human- and machine-oriented knowledge is useful
e.g. to produce mixed MT/MAHT systems
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
24/30
A contribution:
the Papillon project
• Goal:
produce many open source dictionaries from a central lexical data base
• Means:
build rich (DiCo) monolingual dictionaries of lexies (senses)
interlink lexies by interlingual links (axies)
use XML & associated tools as basis to generate many formats
• for humans and for machines
start from (free) digital resources
induce "consumers" to become "producers" (contributors)
• Quality control:
private accounts
central validating/integrating group
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
25/30
Papillon database macrostructure
User
User
User
Interaction with
the Dictionaries
Dictionary Dictionary
Extraction of
Dictionaries
Lexical
Human
Contributors
Database
Integration of
existing resources
Resource
Ch. Boitet
Resource
Resource
ICUKL2002, Goa, 25-29/11/2002
26/30
PAPILLON diagram
French. DiCo
Vocable carte n.f.
Lexie carte.1
carte à jouer
Lexie carte.2
carte géographique
Thai DiCo
Japan. DiCo
Interlingual links
Acception 343
UNL:
card(icl>play),
card(icl>thing)…
Acception 345
UNL: map(fld>geography)
Acception 1002
UNL: card(fld>money)
a
カード
地図
Engl. DiCo
Vocable card N
Lexie card.1
playing card
Lexie card.2
money card
Interlingual links based on translations = "AXIEs"
Possibility to link 1 lexie with >1 acceptions
References to other semantic systems: AXIE—1————n—>UW
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
Vocable=lexie
map
27/30
Construct systems where quality can
be improved "on demand" by users
• a priori through interactive disambiguation in the source
language
• or a posteriori by correcting the pivot representation (UNL
or other) through any language (as in MultiMeteo)
==> In the 2 cases, all versions (in all languages) are improved
• possibility to merge
MT
multilingual generation
computer-aided authoring
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
28/30
Conclusion
• 4 keys to open the door to MT of adequate quality to all
languages
• On the technical side,
dramatically increase the use of learning techniques
use pivot architectures, the most universally usable pivot being UNL
• On the organizational side,
cooperatively develop open source linguistic resources on the web
construct systems where quality can be improved "on demand" by users
• On the practical side,
seek keys to unlock private investment, public funding, voluntary cooperation
could this conference become a decisive turning point?
Ch. Boitet
ICUKL2002, Goa, 25-29/11/2002
29/30