Transcript Document
Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1 Overview • • • • Machine Translation (MT) Rule-based MT Statistical MT Hybrid MT 2 MT: What is it? • Input: text in source language • Output text in target language that is a translation of the input text 3 MT: What is it? Interlingua Analyzed input transfer Analyzed output Input direct translation Output 4 MT: System Types • Direct: – Earliest systems (1950s) • Direct word-to-word translation – Recent statistical MT systems • Transfer – Almost all research and commercial systems <= 1990 • Interlingual 5 MT: System Types • Interlingual – A few research systems in the 1980s • Rosetta (Philips), based on Montague Grammar – Semantic derivation trees of attuned grammars • Distributed Translation (BSO) – (enriched) Esperanto • Sometimes logical representations • Hybrid Interlingual/Transfer – Transfer for lexicons; IL for rules 6 Rule-Based Systems • Most systems – explicit source language grammar – parser yields analysis of source language input – transfer component turns it into target language structure – no explicit grammar of target language (except morphology) 7 Rule-Based Systems • Some systems (Eurotra) – explicit source and target language grammar • sometimes reversible – parser yields analysis of source language input – transfer component turns it into target language structure – generation of translation by target language grammar 8 Rule-Based Systems • Some systems (Rosetta, DLT) – explicit source and target language grammar • in some cases reversible – parser yields interlingual representation – generation of translation by target language grammar from interlingual representation 9 MT: Is it difficult? • FAHQT: Fully Automatic High Quality Translation – Fully Automatic: no human intervention – High Quality: close or equal to human translation • Even acceptable quality is difficult to achieve 10 MT: Problems • Ambiguity – Real • Cannot be resolved by grammar • Is much higher than a human can imagine! • Require world knowledge modeling or statistics – Temporary • Are resolved by the grammar but require large computational resources 11 MT: Problems • Computational Complexity – Most rule based systems with a context-free base (O(n3)) plus extensions (O(?)) – Require large computational resources – Require large memory resources – Sentences with length > 20 hardly processable 12 MT: Problems • Complexity of language – Many different construction types – All interacting with each other – Full coverage is hard to achieve often fall back on robustness measures – For many constructions proper analysis is not known – Theoretical linguistics is not going to help because of focus on explanatory adequacy 13 MT: Problems • Divergences between languages – Lexical categorial: • zich ergeren v. (be) annoyed (Verb-Adj) • hij zwemt graag vs. he likes to swim – Phrasal categorial • I expect her to leave – ik verwacht dat zij vertrekt • She is likely to come – het is waarschijnlijk dat zij komt 14 Conflational Divergences: • prepositional complements – houden van vs. love • existential er vs. Ø – er passeerde een auto vs. – a car passed • verbal particles – blow (something) up vs. volar 15 Conflational Divergences: • reflexive verbs – zich scheren vs. shave • composed vs. simple tense forms – he will do it vs. lo hará • split negatives vs. composed negatives – he does not see anyone vs. – hij ziet niemand 16 Functional Divergences: • I like these apples – me gustan estas manzanas • se venden manzanas aqui – hier verkoopt men appels • er werd door de toeschouwers gejuicht – the spectators were cheering 17 Divergences: MWEs • semi-fixed MWEs – nuclear power plant vs. kerncentrale • flexible idioms – de plaat poetsen vs. bolt – de pijp uit gaan v. to kick the bucket 18 Divergences: MWEs • semi-idioms (collocations) – zware shag vs. strong tobacco • semi-idioms (support verbs) – aandacht besteden aan – pay attention to 19 MT: Why is it so difficult? • Language Competence v. Language Use – Earlier research systems implemented idealized reality – But not the really occurring language use – In some cases • focus on theoretically interesting difficult constructions (that do occur in reality) • But other constructions are more important to deal with in practical systems 20 MT: Why is it so difficult? • Large and rich lexicons – Existing human-oriented dictionaries are not suited as such – All information must be available in a formalized way – Much more information is needed than in a traditional dictionary 21 MT: Why is it so difficult? • Multi-word Expressions (MWEs) – Are in current dictionaries only in a very informal way – No standards on how to represent them lexically – Many different types requiring different treatment in the grammar – Huge numbers!! – Domain and company-specific terminology are often MWEs 22 MT: Why is it so difficult? • All systems must make approximations: – Ignore certain ambiguities to begin with – Use only limited amount of relevant information – Cut off analysis when there are too many alternatives 23 Statistical MT • Statistical MT • Derives MT-system automatically – From statistics taken from • Aligned parallel corpora ( translation model) • Monolingual target language corpora ( language model) • Being worked since early 90’s • Paradigm originates in speech recognition (and these in noisy channel models) 24 MT: Can we make it possible? • Plus: – No or very limited grammar development – Includes language and world knowledge automatically (but implicitly) – Based on actually occurring data – Currently many experimental and commercial systems • Minus: – Requires large aligned parallel corpora – Clearly has problems with longer span dependencies 25 Statistical MT • • • • Google Translate (statistical MT) Hij draagt een pak. √He wears a suit. Hij draagt schoenen. √ He wears shoes. Hij draagt bruine schoenen en een pak. • √ He wears a suit and brown shoes. (!!) • Hij draagt het pakket √ He carries the package • Hij heeft een pak aan. *He has a suit. • Voert uw bedrijf sloten uit? • – *Does your company locks out? 26 Hybrid MT: • Can we somehow combine the strengths of rulebased approaches and the statistical approaches – And avoid their disadvantages? • Active Research area – Several projects 27 Hybrid MT • Euromatrix esp. “the Euromatrix” – Lists data and tools for European language pairs – Goals • Translation systems for all pairs of EU languages • Organization, analysis and interpretation of a competitive annual international evaluation of machine translation • The provision of open source machine translation technology including research tools, software and data • A systematically compiled and constantly updated detailed survey of the state of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation • The development and testing of hybrid architectures for the integration of rule-based and statistical approaches • Successor project EuromatrixPlus 28 Hybrid MT • PACO-MT 2008-2011 • Investigates hybrid approach to MT – Rule-based and statistical – Uses existing parser for source language analysis – Uses statistical n-gram language models for generation – Uses statistical approach to transfer 29 Hybrid MT • META-NET 2010-2013 (EU-funding) – Building a community with shared vision and strategic research agenda – Building META-SHARE, an open resource exchange facility – Building bridges to neighbouring technology fields • • • • Bringing more Semantics into Translation Optimising the Division of Labour in Hybrid MT Exploiting the Context for Translation Empirical Base for Machine Translation 30 Hybrid MT • Bringing more Semantics into Translation – Charles University Prague (Jan Hajic) – FBK-Irst, Trento (Marcello Federico) – UiL-OTS, Utrecht (Christer Samuelsson) • currently orienting ourselves and trying to determine a concrete topic for investigation 31 Hybrid MT: Semantics • Possible Topics: – lexical semantics and their resources / Word Sense Disambiguation – knowledge representations – multiword expressions – Syntactic and semantic dependencies / Semantic Role Labeling – Discourse structure – Co-reference resolution – Recognizing Textual Entailment and MT Evaluation 32 Semantics resources • Lexical Semantics – Resources: WordNet, EuroWordNet, BalkaNet, WordNets for several languages – Knowledge Repositories: • OpenCyc, Wikipedia, DBpedia • MWE Lexica: SAID, DUELME • 33 Semantics Resources • CoNLL 2009 Shared Task on syntactic and semantic dependencies – training and development data – evaluation data • PennDiscource TreeBank 34 Hybrid MT • Tools: • SRL and Semantic Parsing: SWIRL , ASSERT , SENNA, C& C (all for Eng), tools developed at LUND University (for Eng and Chn) 35 Semantics Resources • Tools: • Co-Reference and Anaphora Resolution: – BART (Eng), – COREA (Dut) • NER: – BIOS (Eng) 36