Document

Transcript Document

Semantics in Statistical
Machine Translation
Jan Odijk
MA-Rotation Lecture
Utrecht March10, 2011
1
Overview
•
•
•
•
Machine Translation (MT)
Rule-based MT
Statistical MT
Hybrid MT
2
MT: What is it?
• Input: text in source language
• Output text in target language that is a
translation of the input text
3
MT: What is it?
Interlingua
Analyzed input  transfer Analyzed output
Input
direct translation
Output
4
MT: System Types
• Direct:
– Earliest systems (1950s)
• Direct word-to-word translation
– Recent statistical MT systems
• Transfer
– Almost all research and commercial systems <=
1990
• Interlingual
5
MT: System Types
• Interlingual
– A few research systems in the 1980s
• Rosetta (Philips), based on Montague Grammar
– Semantic derivation trees of attuned grammars
• Distributed Translation (BSO)
– (enriched) Esperanto
• Sometimes logical representations
• Hybrid Interlingual/Transfer
– Transfer for lexicons; IL for rules
6
Rule-Based Systems
• Most systems
– explicit source language grammar
– parser yields analysis of source language input
– transfer component turns it into target language
structure
– no explicit grammar of target language (except
morphology)
7
Rule-Based Systems
• Some systems (Eurotra)
– explicit source and target language grammar
• sometimes reversible
– parser yields analysis of source language input
– transfer component turns it into target language
structure
– generation of translation by target language
grammar
8
Rule-Based Systems
• Some systems (Rosetta, DLT)
– explicit source and target language grammar
• in some cases reversible
– parser yields interlingual representation
– generation of translation by target language
grammar from interlingual representation
9
MT: Is it difficult?
• FAHQT: Fully Automatic High Quality
Translation
– Fully Automatic: no human intervention
– High Quality: close or equal to human
translation
• Even acceptable quality is difficult to
achieve
10
MT: Problems
• Ambiguity
– Real
• Cannot be resolved by grammar
• Is much higher than a human can imagine!
• Require world knowledge modeling or statistics
– Temporary
• Are resolved by the grammar but require large
computational resources
11
MT: Problems
• Computational Complexity
– Most rule based systems with a context-free
base (O(n3)) plus extensions (O(?))
– Require large computational resources
– Require large memory resources
– Sentences with length > 20 hardly processable
12
MT: Problems
• Complexity of language
– Many different construction types
– All interacting with each other
– Full coverage is hard to achieve  often fall
back on robustness measures
– For many constructions proper analysis is not
known
– Theoretical linguistics is not going to help
because of focus on explanatory adequacy
13
MT: Problems
• Divergences between languages
– Lexical categorial:
• zich ergeren v. (be) annoyed (Verb-Adj)
• hij zwemt graag vs. he likes to swim
– Phrasal categorial
• I expect her to leave
– ik verwacht dat zij vertrekt
• She is likely to come
– het is waarschijnlijk dat zij komt
14
Conflational Divergences:
• prepositional complements
– houden van vs. love
• existential er vs. Ø
– er passeerde een auto vs.
– a car passed
• verbal particles
– blow (something) up vs. volar
15
Conflational Divergences:
• reflexive verbs
– zich scheren vs. shave
• composed vs. simple tense forms
– he will do it vs. lo hará
• split negatives vs. composed negatives
– he does not see anyone vs.
– hij ziet niemand
16
Functional Divergences:
• I like these apples
– me gustan estas manzanas
• se venden manzanas aqui
– hier verkoopt men appels
• er werd door de toeschouwers gejuicht
– the spectators were cheering
17
Divergences: MWEs
• semi-fixed MWEs
– nuclear power plant vs. kerncentrale
• flexible idioms
– de plaat poetsen vs. bolt
– de pijp uit gaan v. to kick the bucket
18
Divergences: MWEs
• semi-idioms (collocations)
– zware shag vs. strong tobacco
• semi-idioms (support verbs)
– aandacht besteden aan
– pay attention to
19
MT: Why is it so difficult?
• Language Competence v. Language Use
– Earlier research systems implemented idealized
reality
– But not the really occurring language use
– In some cases
• focus on theoretically interesting difficult
constructions (that do occur in reality)
• But other constructions are more important to deal
with in practical systems
20
MT: Why is it so difficult?
• Large and rich lexicons
– Existing human-oriented dictionaries are not
suited as such
– All information must be available in a
formalized way
– Much more information is needed than in a
traditional dictionary
21
MT: Why is it so difficult?
• Multi-word Expressions (MWEs)
– Are in current dictionaries only in a very
informal way
– No standards on how to represent them
lexically
– Many different types requiring different
treatment in the grammar
– Huge numbers!!
– Domain and company-specific terminology are
often MWEs
22
MT: Why is it so difficult?
• All systems must make approximations:
– Ignore certain ambiguities to begin with
– Use only limited amount of relevant
information
– Cut off analysis when there are too many
alternatives
23
Statistical MT
• Statistical MT
• Derives MT-system automatically
– From statistics taken from
• Aligned parallel corpora ( translation model)
• Monolingual target language corpora ( language
model)
• Being worked since early 90’s
• Paradigm originates in speech recognition
(and these in noisy channel models)
24
MT: Can we make it possible?
• Plus:
– No or very limited grammar development
– Includes language and world knowledge automatically
(but implicitly)
– Based on actually occurring data
– Currently many experimental and commercial systems
• Minus:
– Requires large aligned parallel corpora
– Clearly has problems with longer span dependencies
25
Statistical MT
•
•
•
•
Google Translate (statistical MT)
Hij draagt een pak.  √He wears a suit.
Hij draagt schoenen.  √ He wears shoes.
Hij draagt bruine schoenen en een pak.
•  √ He wears a suit and brown shoes. (!!)
• Hij draagt het pakket  √ He carries the package
• Hij heeft een pak aan.  *He has a suit.
• Voert uw bedrijf sloten uit?
•
–  *Does your company locks out?
26
Hybrid MT:
• Can we somehow combine the strengths of rulebased approaches and the statistical approaches
– And avoid their disadvantages?
• Active Research area
– Several projects
27
Hybrid MT
• Euromatrix esp. “the Euromatrix”
– Lists data and tools for European language pairs
– Goals
• Translation systems for all pairs of EU languages
• Organization, analysis and interpretation of a competitive annual international
evaluation of machine translation
• The provision of open source machine translation technology including
research tools, software and data
• A systematically compiled and constantly updated detailed survey of the state
of MT technology for all EU language pairs
• Efficient inclusion of linguistic knowledge into statistical machine translation
• The development and testing of hybrid architectures for the integration of
rule-based and statistical approaches
• Successor project EuromatrixPlus
28
Hybrid MT
• PACO-MT 2008-2011
• Investigates hybrid approach to MT
– Rule-based and statistical
– Uses existing parser for source language
analysis
– Uses statistical n-gram language models for
generation
– Uses statistical approach to transfer
29
Hybrid MT
• META-NET 2010-2013 (EU-funding)
– Building a community with shared vision and strategic
research agenda
– Building META-SHARE, an open resource exchange
facility
– Building bridges to neighbouring technology fields
•
•
•
•
Bringing more Semantics into Translation
Optimising the Division of Labour in Hybrid MT
Exploiting the Context for Translation
Empirical Base for Machine Translation
30
Hybrid MT
• Bringing more Semantics into Translation
– Charles University Prague (Jan Hajic)
– FBK-Irst, Trento (Marcello Federico)
– UiL-OTS, Utrecht (Christer Samuelsson)
• currently orienting ourselves and trying to
determine a concrete topic for investigation
31
Hybrid MT: Semantics
• Possible Topics:
– lexical semantics and their resources / Word
Sense Disambiguation
– knowledge representations
– multiword expressions
– Syntactic and semantic dependencies /
Semantic Role Labeling
– Discourse structure
– Co-reference resolution
– Recognizing Textual Entailment and MT
Evaluation
32
Semantics resources
• Lexical Semantics
– Resources: WordNet, EuroWordNet, BalkaNet,
WordNets for several languages
– Knowledge Repositories:
• OpenCyc, Wikipedia, DBpedia
• MWE Lexica: SAID, DUELME
•
33
Semantics Resources
• CoNLL 2009 Shared Task on syntactic and
semantic dependencies
– training and development data
– evaluation data
• PennDiscource TreeBank
34
Hybrid MT
• Tools:
• SRL and Semantic
Parsing: SWIRL , ASSERT , SENNA, C&
C (all for Eng), tools developed at LUND
University (for Eng and Chn)
35
Semantics Resources
• Tools:
• Co-Reference and Anaphora Resolution:
– BART (Eng),
– COREA (Dut)
• NER:
– BIOS (Eng)
36