The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute

Download Report

Transcript The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute

The AVENUE Project Data
Elicitation System
Lori Levin
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Joint work with
• Dr. Jeff Good
• Dr. Robert Frederking
• Alison Alvarez
Outline
• The AVENUE MT project
– Including a list of languages we have worked on
• The elicitation tool
– Including which kinds of fonts it works for
• The elicitation corpus
– Including which languages it has been translated into
• Tools for building and revising elicitation
corpora
MT Approaches
Interlingua:
Syntactic
Parsing
Pronoun-acc-1-sg
chiamare-1sg N
Semantic
Analysis
introduce-self
Sentence
Planning
Transfer Rules
Text
Generation
[np poss-1sg
“name”] BE-pres
N
AVENUE: Automate Rule Learning
Source
Mi chiamo Lori
Direct: SMT, EBMT
Target
My name is Lori
AVENUE Machine Translation
System
Type information
Synchronous Context Free
Rules
Alignments
x-side constraints
y-side constraints
xy-constraints,
e.g. ((Y1 AGR) = (X1 AGR))
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP
(
(X1::Y1)
(X1::Y3)
(X2::Y4)
(X3::Y2)
[DET ADJ N] -> [DET N DET ADJ]
((X1 AGR) = *3-SING)
((X1 DEF = *DEF)
((X3 AGR) = *3-SING)
((X3 COUNT) = +)
((Y1 DEF) = *DEF)
((Y3 DEF) = *DEF)
((Y2 AGR) = *3-SING)
((Y2 GENDER) = (Y4 GENDER))
)
Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI)
Rule learning: Katharina Probst
AVENUE
• Rules can be written by hand or learned
automatically.
• Hybrid
– Rule-based transfer
– Statistical decoder
– Multi-engine combinations with SMT and EBMT
AVENUE systems
(Small and experimental, but tested on unseen data)
• Hebrew-to-English
– Alon Lavie, Shuly Wintner, Katharina Probst
– Hand-written and automatically learned
– Automatic rules trained on 120 sentences perform
slightly better than about 20 hand-written rules.
• Hindi-to-English
– Lavie, Peterson, Probst, Levin, Font, Cohen, Monson
– Automatically learned
– Performs better than SMT when training data is limited
to 50K words
AVENUE systems
(Small and experimental, but tested on unseen data)
• English-to-Spanish
– Ariadna Font Llitjos
– Hand-written, automatically corrected
• Mapudungun-to-Spanish
– Roberto Aranovich and Christian Monson
– Hand-written
• Dutch-to-English
– Simon Zwarts
– Hand-written
Outline
• The AVENUE MT project
The elicitation tool
• The questionnaire
• Tools for building questionnaires
Elicitation
• Get data from someone who is
– Bilingual
– Literate
• With consistent spelling
– Not experienced with linguistics
English-Hindi Example
Elicitation Tool: Erik Peterson
English-Chinese Example
Note: Translator has to insert
spaces between words in
Chinese.
English-Arabic Example
Outline
• The AVENUE MT project
• The elicitation tool
The elicitation corpus
• Tools for building elicitation corpora
Size of Questionnaire
• Around 3200 sentences
• 20K words
EC Sample: clause level
•
•
•
•
•
•
•
•
•
•
•
Mary is writing a book for John.
Who let him eat the sandwich?
Who had the machine crush the
car?
They did not make the policeman
run.
Mary had not blinked.
The policewoman was willing to
chase the boy.
Our brothers did not destroy files.
He said that there is not a manual.
The teacher who wrote a textbook
left.
The policeman chased the man
who was a thief.
Mary began to work.
•
•
•
•
•
Tense, aspect, transitivity,
animacy
Questions, causation and
permission
Interaction of lexical and
grammatical aspect
Volitionality
•
Embedded clauses and sequence
of tense
Relative clauses
•
Phase aspect
EC Sample:
noun phrase level
• The man quit in November.
• The man works in the
afternoon.
• The balloon floated over the
library.
• The man walked over the
platform.
• The man came out from
among the group of boys.
• The long weekly meeting
ended.
• The large bus to the post office
broke down.
• The second man laughed.
• All five boys laughed.
•
•
•
•
Temporal and locative meanings
Quantifiers
Numbers
Combinations of different types of
modifers
– My book
• Possession, definiteness
– A book of mine
• Possession, indefiniteness
Organization into Minimal Pairs
srcsent: Tú caíste.
tgtsent: Eymi ütrünagimi.
aligned: ((1,1),(2,2))
context: tú = Juan [masculino, 2a persona del singular]
comment: You (John) fell
srcsent: Tú estás cayendo.
tgtsent: Eymi petu ütrünagimi.
aligned: ((1,1),(2 3,2 3))
context: tú = Juan [masculino, 2a persona del singular]
comment: You (John) are falling
srcsent: Tú caíste .
tgtsent: Eymi ütrunagimi.
aligned: ((1,1),(2,2))
context: tú = María [femenino, 2a persona del singular]
comment: You (Mary) fell
Feature Detection: Spanish
The girl saw a red book.
((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
La niña vió un libro rojo
A girl saw a red book
((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))
Una niña vió un libro rojo
I saw the red book
((1,1)(2,2)(3,3)(4,5)(5,4))
Yo vi el libro rojo
I saw a red book.
((1,1)(2,2)(3,3)(4,5)(5,4))
Yo vi un libro rojo
Feature: definiteness
Values: definite, indefinite
Function-of-*: subj, obj
Marked-on-head-of-*: no
Marked-on-dependent: yes
Marked-on-governor: no
Marked-on-other: no
Add/delete-word: no
Change-in-alignment: no
Feature Detection: Chinese
A girl saw a red book.
((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8))
有 一个 女人 看见 了 一本 红色 的 书 。
The girl saw a red book.
((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7))
女人 看见 了 一本 红色的 书
Feature: definiteness
Values: definite, indefinite
Function-of-*: subject
Marked-on-head-of-*: no
Marked-on-dependent: no
Marked-on-governor: no
Add/delete-word: yes
Change-in-alignment: no
Feature Detection: Chinese
I saw the red book
((1, 3)(2, 4)(2, 5)(4, 1)(5, 2))
红色的 书, 我 看见 了
I saw a red book.
((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6))
我 看见 了 一本 红色的 书 。
Feature: definitenes
Values: definite, indefinite
Function-of-*: object
Marked-on-head-of-*: no
Marked-on-dependent: no
Marked-on-governor: no
Add/delete-word: yes
Change-in-alignment: yes
Feature Detection: Hebrew
A girl saw a red book.
((2,1) (3,2)(5,4)(6,3))
‫ילדה ראתה ספר אדום‬
The girl saw a red book
((1,1)(2,1)(3,2)(5,4)(6,3))
‫הילדה ראתה ספר אדום‬
I saw a red book.
((2,1)(4,3)(5,2))
‫ראיתי ספר אדום‬
I saw the red book.
((2,1)(3,3)(3,4)(4,4)(5,3))
‫ראיתי את הספר האדום‬
Feature: definiteness
Values: definite, indefinite
Function-of-*: subj, obj
Marked-on-head-of-*: yes
Marked-on-dependent: yes
Marked-on-governor: no
Add-word: no
Change-in-alignment: no
Feature Detection Feeds into…
• Corpus Navigation: which minimal pairs to pursue next.
– Don’t pursue gender in Mapudungun
– Do pursue definiteness in Hebrew
• Morphology Learning:
– Morphological learner identifies the forms of the morphemes
– Feature detection identifies the functions
• Rule learning:
– Rule learner will have to learn a constraint for each morphosyntactic marker that is discovered
• E.g., Adjectives and nouns agree in gender, number, and definiteness
in Hebrew.
Languages
• The set of feature structures with English
sentences has been delivered to the Linguistic
Data Consortium as part of the Reflex program.
• Translated (by LDC) into:
– Thai
– Bengali
• Plans to translate into:
– Seven “strategic” languages per year for five years.
• As one small part of a language pack (BLARK) for each
language.
Languages
• Spanish version in progress at New Mexico
State University (Helmreich and Cowie)
– Plans to translate into Guarani
• Portuguese version in progress in Brazil
(Marcello Modesto)
– Plans to translate into Karitiana
• 200 speakers
• Plans to translate into Inupiaq (Kaplan and
MacLean)
Previous Elicitation Work
• Pilot corpus
– Around 900 sentences
– No feature structures
• Mapudungun
– Two partial translations
• Quechua
– Three translations
• Aymara
– Seven translations
• Hebrew
• Hindi
– Several translations
• Dutch
Feature Structures
• The EC is actually a corpus of feature
structures that happen to have English or
Spanish sentences attached to them.
Bengali example with feature structure
srcsent: The large bus to the post office broke down.
context:
tgtsent:
((actor ((modifier ((mod-role mod-descriptor)
(mod-role role-loc-general-to)))
(np-identifiability identifiable)(np-specificity specific)
(np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate)
(np-person person-third)(np-function fn-actor)(np-general-type common-nountype)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronounantecedent antecedent-n/a)(np-distance distance-neutral)))
(c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(ccomparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary
boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control controln/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causationdirectness directness-n/a)(c-source source-neutral)(c-causee-volitionality volitionn/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity
polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type
adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect
activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality
event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copulatype copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-ourshared-subject shared-subject-n/a)(c-question-gap gap-n/a))
Why feature structures?
• Decide what grammatical meaning to
elicit.
• Represent it in a feature structure.
• Formulate an English or Spanish sentence
that expresses that meaning.
– We can use the same corpus of feature
structures for several elicitation languages
• Have the informant translate it.
Grammatical meanings vs syntactic
categories
• Features and values are based on a
collection of grammatical meanings
– Many of which are similar to the
grammatemes of the Prague Treebanks
Grammatical Meanings
YES
• Semantic Roles
• Identifiability
• Specificity
• Time
– Before, after, or during
time of speech
• Modality
NO
• Case
• Voice
• Determiners
• Auxiliary verbs
Grammatical Meanings
YES
• How is identifiability
expressed?
–
–
–
–
Determiner
Word order
Optional case marker
Optional verb agreement
• How is specificity
expressed?
• How are generics
expressed?
• How are predicate
nominals marked?
NO
• How are English
determiners translated?
–
–
–
–
The boy cried.
The lion is a fierce beast.
I ate a sandwich.
He is a soldier.
• Il est soldat.
Argument Roles
• Actor
• Undergoer
• Predicate and predicatee
– The woman is the manager.
• Recipient
– I gave a book to the students.
• Beneficiary
– I made a phone call for Sam.
Why not subject and object?
• Languages use their voice systems for different
purposes.
• Mapudungun obligatorily uses an inverse marked verb
when third person acts on first or second person.
– Verb agrees with undergoer
– Undergoer exhibits other subjecthood properties
– Actor may be object.
• Yes: How are actor and undergoer encoded in
combination with other semantic features like adversity
(Japanese) and person (Mapudungun)?
• No: How is English voice translated into another
language?
Argument Roles
• Accompaniment
– With someone
– With pleasure
• Material
– (out) of wood
• About 20 more roles
– From the Lingua checklist; Comrie & Smith (1977)
– Many also found in tectogrammatical representations in the
Prague Treebanks
• Around 80 locative relations
– From Lingua checklist
• Many temporal relations
Noun Phrase Features
•
•
•
•
•
•
•
•
•
Person
Number
Biological gender
Animacy
Distance (for deictics)
Identifiability
Specificity
Possession
Other semantic roles
– Accompaniment, material,
location, time, etc.
• Type
– Proper, common, pronoun
•
•
•
•
Cardinals
Ordinals
Quantifiers
Given and new
information
– Not used yet because of
limited context in the
elicitation tool.
Clause level features
• Tense
• Aspect
– Lexical, grammatical,
phase
• Type
– Declarative, open-q,
yes-no-q
• Function
– Main, argument,
adjunct, relative
• Source
– Hearsay, first-hand,
sensory, assumed
• Assertedness
– Asserted,
presupposed, wanted
• Modality
– Permission, obligation
– Internal, external
Other clause types
(Constructions)
• Causative
– Make/let/have someone do something
• Predication
– May be expressed with or without an overt copula.
• Existential
– There is a problem.
• Impersonal
– One doesn’t smoke in restaurants in the US.
• Lament
– If only I had read the paper.
• Conditional
• Comparative
• Etc.
Outline
• The AVENUE MT project
• The elicitation tool
• The elicitation corpus
Tools for elicitation corpora
Tools for Creating Elicitation Corpora
Feature
Specification
List of semantic
features and
values
XML Schema
XSLT Script
ClauseLevel
NounPhrase
Tense &
Aspect
…
Modality
Feature Maps: which
combinations of
features and values
are of interest
Feature Structure Sets
Reverse Annotated Feature Structure
Sets: add English sentences
The Corpus
Mar 1, 2006
Sampling
Smaller Corpus
Tools for Creating Elicitation Corpora
Feature
Specification
List of semantic
features and
values
ClauseLevel
Tense &
Aspect
…
Modality
Feature Maps: which
combinations of
features and values
Feature Structure Sets
are of interest
Combination
Formalism
Reverse Annotated Feature Structure
Sets: add English sentences
The Corpus
Mar 1, 2006
NounPhrase
Sampling
Smaller Corpus
Tools for Creating Elicitation Corpora
Feature
Specification
ClauseLevel
NounPhrase
Tense &
Aspect
List of semantic
features and
values
Feature Maps: which
combinations of
features and values
are of interest
Feature Structure Sets
Feature
Structure
Viewer
Reverse Annotated Feature Structure
Sets: add English sentences
The Corpus
Mar 1, 2006
…
Modality
Sampling
Smaller Corpus
Tools for Creating Elicitation Corpora
Feature
Specification
ClauseLevel
NounPhrase
Tense &
Aspect
List of semantic
features and
values
…
Modality
Feature Maps: which
combinations of
features and values
are of interest
Feature Structure Sets
Reverse Annotated Feature Structure
Sets: add English sentences
The Corpus
Mar 1, 2006
Sampling
Smaller Corpus
Feature Specification
• Defines Features and their values
• Sets default values for features
• Specifies feature requirements and
restrictions
• Written in XML
Feature Specification
Feature: c-copula-type
(a copula is a verb like “be”; some languages do not have copulas)
Values
copula-n/a
Restrictions: 1. ~(c-secondary-type secondary-copula)
Notes:
copula-role
Restrictions: 1. (c-secondary-type secondary-copula)
Notes: 1. A role is something like a job or a function. "He is a
teacher"
"This is a vegetable peeler"
copula-identity
Restrictions: 1. (c-secondary-type secondary-copula)
Notes: 1. "Clark Kent is Superman" "Sam is the teacher"
copula-location
Restrictions: 1. (c-secondary-type secondary-copula)
Notes: 1. "The book is on the table" There is a long list of locative
relations later in the feature specification.
copula-description
Restrictions: 1. (c-secondary-type secondary-copula)
Notes: 1. A description is an attribute. "The children are happy."
"The books are long."
Feature Maps
• Some features interact in the grammar
– English –s reflects person and number of the subject and tense of
the verb.
– In expressing the English present progressive tense, the auxiliary
verb is in a different place in a question and a statement:
• He is running.
• Is he running?
• We need to check many, but not all combinations of
features and values.
• Using unlimited feature combinations leads to an
unmanageable number of sentences
Feature Combination Template
((predicatee
((np-general-type pronoun-type commonnoun-type)
(np-person person-first person-second
person-third)
(np-number num-sg num-pl)
(np-biological-gender bio-gender-male biogender-female)))
{[(predicate ((np-general-type commonnoun-type)
(np-person person-third)))
(c-copula-type role)]
[(predicate ((adj-general-type quality-type)
(c-copula-type attributive)))]
[(predicate ((np-general-type commonnoun-type)
(np-person person-third)
(c-copula-type identity)))]}
(c-secondary-type secondary-copula) (cpolarity #all)
(c-general-type declarative)
(c-speech-act sp-act-state)
(c-v-grammatical-aspect gram-aspectneutral)
(c-v-lexical-aspect state)
(c-v-absolute-tense past present future)
(c-v-phase-aspect durative))
Summarizes 288 feature
structures, which are
automatically generated.
Adding Sentences to Feature
Structures
srcsent: Mary was not a leader.
context: Translate this as though it were spoken to a peer coworker;
((actor ((np-function fn-actor)(np-animacy anim-human)(npbiological-gender bio-gender-female) (np-general-type
proper-noun-type)(np-identifiability identifiable)(npspecificity specific)…))
(pred ((np-function fn-predicate-nominal)(np-animacy animhuman)(np-biological-gender bio-gender-female) (npgeneral-type common-noun-type)(np-specificity specificityneutral)…))
(c-v-lexical-aspect state)(c-copula-type copula-role)(c-secondary-type
secondary-copula)(c-solidarity solidarity-neutral) (c-v-grammaticalaspect gram-aspect-neutral)(c-v-absolute-tense past) (c-v-phaseaspect phase-aspect-neutral) (c-general-type declarative-clause)(cpolarity polarity-negative)(c-my-causer-intentionality intentionalityn/a)(c-comparison-type comparison-n/a)(c-relative-tense relativen/a)(c-our-boundary boundary-n/a)…)
Difficult Issues in Adding Sentences
• Have to remember that the grammatical
meanings don’t correspond exactly to English
morphemes.
– Identifiability and specificity vs the and a
– Modality, tense, aspect vs auxiliary verbs
• The meaning has to be clear to a translator.
– If English is going to be the source language for
translation, the clearest way to say something may
not be the most common way it is said in real text or
conversation.
Hard Problems
• Expressing meanings that are not
grammaticalized in English.
– Evidentiality:
• He stole the bread.
• Context: Translate this as if you do not
have first hand knowledge. In English, we
might say, “They say that he stole the
bread” or “I hear that he stole the bread.”
Hard Problems
• Reverse annotating things that can be said
in several ways in English.
– Impersonals:
•
•
•
•
•
One doesn’t smoke here.
You don’t smoke here.
They don’t smoke here.
There’s no smoking here.
Credit cards aren’t accepted.
– Problem in the Reflex corpus because space
was limited.
Evaluation
• Current funding has not covered
evaluation of the questionnaire.
– Except for informal observations as it was
translated into several languages.
• Does it elicit the meanings it was intended
to elicit?
– Informal observation: usually
• Is it useful for machine translation?
Navigation
• Currently, feature combinations are specified by
a human.
• Plan to work in active learning mode.
–
–
–
–
–
–
–
–
Build seed questionnaire
Translate some data
Do some learning
Identify most valuable pieces of information to get
next
Generate an RTB for those pieces of information
Translate more
Learn more
Generate more, etc.
Summary
• Feature Specification:
– lists features and values
– Grammatical meanings
•
•
•
•
Feature Combinations
Set of Feature Structures
Add English or Spanish Sentences
Get a translation and word alignment from
a bilingual, literate informant