Hindi Generation from Interlingua (UNL) Om P. Damani, IIT Bombay (Joint work with S.

Download Report

Transcript Hindi Generation from Interlingua (UNL) Om P. Damani, IIT Bombay (Joint work with S.

Hindi Generation from
Interlingua (UNL)
Om P. Damani, IIT Bombay
(Joint work with S. Singh, M. Dalal, V.
Vachhani, P. Bhattacharya)
Cross Lingual Information Access:
Language Independent Representation
System Architecture for Achieving
Cross Lingual Information Access
Hindi,
English,
or
Marathi
Query
Enconvertor
UNL
Expression
Search Module
UNL Documents
UNL Documents
Output Generator
Hindi
Deconvertor
Marathi
Deconvertor
Results in
Hindi
Results in
Marathi
Universal Networking Language (UNL)
eat (icl>do) @entry.
@present
ins
spoon(icl>artifact) @indef
agt
obj
rice(icl>food)
John(iof>person)
[UNL:1]
agt(eat(icl>do).@entry.@present, John(iof>person))
obj(eat(icl>do).@entry.@present, rice(icl>food))
ins(eat(icl>do).@entry.@present, spoon(icl>artifact))
[\UNL]
UNL Scopes: Representing Embeddings
Contact(..)
obj
pur
farmer
obj(contact(icl>communicate(agt>person,
obj>person)):0W.@imperative.@entry,
farmer(icl>creator):1T.@pl.@def)
agt
you
this
plc
:01
or
agt(contact(icl>communicate(agt>person,
obj>person)):0W.@imperative.@entry,
you(icl>persons):0J)
nam plc(farmer(icl>creator):1T.@pl.@def,:01)
or:01(region(icl>location):38.@entry,taluk
a(icl>geographicalarea):4A)
region
taluka
nam
khatav
pur(contact(icl>communicate(agt>person,
obj>person)):0W.@imperative.@entry,
this:04)
manchar
nam:01(region(icl>location):38.@entry,M
anchar(icl>geographical place):2R)
nam:01(taluka(icl>geographical
area):4A,Khatav(icl>geographical
area):3U)
For this, you contact the farmers of Manchar region or of Khatav taluka
HinD (Hindi Deconverter): Yet Another
Deconversion Engine
Manati
Deco
HinD
agt
Agt
Output
Quality
Aoj
Obj
aoj
obj
ins
L
L
L
L
L
R
Ins
Suffix
Attribute values
uoM
@N,@NU,@M,@pl,@oblique
U
I
iyoM
@N,@NU,@M,@sg,@oblique
@N,@NI,@F,@sg,@oblique
@N,@NI,@F,@pl,@oblique
oM
@N,@NA,@NOTCH,@F,@pl,@oblique
Suffix
Tense
Aspect
Mood
N
Gen
P
V
E
-e rahaa
thaa
@past
@progress
-
@sg
@male
3rd
e
-taa hai
@present
@custom
-
@sg
@male
3rd
-
@complete
-
@sg
@male
3rd
I
-
@ability
@pl
@female
3rd
A
-iyaa thaa @past
saktii hain @present
Simplicity of Specification
HinD Architecture
Deconversion = Transfer + Generation
Step-through the Deconverter
contact
obj
agt
pur
farmer
you
this
plc
:01
or
Output
obj(contact(…
Lexeme
Selection
संपर्क कर्सान ् यह आप क्षेत्र ् तालुर्् मंचर खटाव
contact farmer this you region taluka manchar khatav
संपर्क कर्सान*् यह आप क्षेत्र*् तालुर््* मंचर खटाव
Case
Identification
region
nam
taluka
Morphology
Generation
nam
khatav
Module
UNL
Expression
manchar
Function
Word
Insertion
Linearization
contact farmer* this you region* taluka* manchar
khatav
संपर्क
र्ीजिए
कर्सानों
इस आप क्षेत्र
contact .@imperative farmer.@pl this you region
तालर्
ु े मंचर खटाव
taluka manchar Khatav
संपर्क र्ीजिए कर्सानों र्ो इसर्े ललए आप क्षेत्र
contact
farmers
this for
you region
या तालुर्े र्े मंचर
खटाव
or taluka of Manchar Khatav
इसर्े ललए आप
मंचर
क्षेत्र या खटाव
This for
you
manchar region or khatav
तालुर्े र्े कर्सानों र्ो संपर्क र्ीजिए
|
taluka of farmers
contact
Lexeme Selection
[संपर्क]{}"contact(icl>communicate(agt>person,obj>person))“
(V,VOA,VOA-ACT,VOA-COMM,VLTN,TMP,CJNCT,N-V,link,Va)
[पहचान र्ाव्यजतत]{}"contact(icl>representative)“
(N,ANIMT,FAUNA,MML,PRSN,Na)
Lexical Choice is unambiguous
obj
farmer कर्सान ्
contact संपर्क
pur
this यह
agt
you आप
Case Marking
Relation Parent+ Parent- Child+
Obj
V
VINT
N
Agt
V
@past
N
Child-
• Depends on UNL Relation and the properties of the nodes
• Case get transferred from head to modifiers
obj
*farmer
कर्सान ्
contact संपर्क
pur
this यह
agt
you आप
Morphology Generation: Nouns
The boy saw me.
लड़र्े ने मझ
ु े दे खा ।
Boys saw me.
लड़र्ों ने मझ
ु े दे खा ।
The King saw me.
रािा ने मुझे दे खा ।
Kings saw me.
रािाऒ ं ने मुझे दे खा ।
Suffix
Attribute values
uoM
@N,@NU,@M,@pl,@oblique
U
I
iyoM
@N,@NU,@M,@sg,@oblique
@N,@NI,@F,@sg,@oblique
@N,@NI,@F,@pl,@oblique
oM
@N,@NA,@NOTCH,@F,@pl,@oblique
Verb Morphology
Suffix
Tense
Aspect
Mood
N
Gen
P
V
E
-e rahaa
thaa
@past
@progress
-
@sg
@male
3rd
e
-taa hai
@present
@custom
-
@sg
@male
3rd
-
@complete
-
@sg
@male
3rd
I
-
@ability
@pl
@female
-iyaa thaa @past
saktii hain @present
laD,,ka iktaba do rha qaa
3rd A
(The boy was giving the book)
Suffixes mentioned in the rules get attached to the
stems selected from the Stem Dictionary
After Morphology Generation
obj
contact संपर्क
pur
farmer
कर्सान ्
obj
farmer कर्सानों
agt
you आप
this यह
contact
संपर्क र्ीजिए
pur
this इस
agt
you आप
Function Word Insertion
संपर्क र्ीजिए कर्सानों
संपर्क र्ीजिए कर्सानों र्ो
यह
आप क्षेत्र
इसर्े ललए
आप क्षेत्र या
तालुर्े
तालुर्े र्े
Rel
Par+
Par-
Chi+
Chi-
Obj
V
VINT
N#ANIMT
@topic
obj
farmer कर्सानों
र्ो
contact संपर्क
र्ीजिए
मंचर
मंचर
Ch/FW
र्ो
agt
pur
this इसर्े
ललए
you आप
खटाव
खटाव
Linearization
Contact
obj
संपर्क र्ीजिए
pur
farmer
agt
you
इस
आप
मंचर
क्षेत्र
This you manchar region
this
plc
:01
or
region
nam
खटाव
तालर्
कर्सानों
ु े
khatav taluka farmers
taluka
nam
khatav
manchar
संपर्क
contact
Syntax Planning: Assumptions

The relative word order of a
UNL relation’s relata does
not depend on:



pur
Semantic Independence:
the semantic properties of
the relata.
Context Independence: the
rest of the expression.
The relative word order of
various relations sharing a
relatum does not depend on

contact
Local Ordering: the rest of
the expression.
this
contact
obj
farmer
agt
pur
this
you
Syntax Planning: Strategy
contact
agt
obj
pur
farmer
you

Divide a nodes relations in
Before_set = {obj,pur,agt}
After_set = {}
this
region
taluka
manch
khatav
agt
Agt
Aoj
Obj
Ins
aoj
obj
ins
L
L
L
L
L
obj
pur
Topo Sort each group:
pur agt obj
this you farmer
R
agt
Final order: this you farmer contact
Syntax Planning Algo
Stack
Before
Current
After
Current
Output
Contact
obj
region
farmer
contact
this
you
manchar
taluka
pur
farmer
you
this
plc
manchar
region
taluka
farmer
contact
agt
:01
or
region
taluka
region
taluka
farmer
Contact
this
you
manchar
taluka
farmer
contact
this
you
manchar
region
nam
khatav
manchar
nam
All Together (UNL -> Hindi)
Module
UNL
Expression
Output
obj(contact(..
Lexeme
Selection
संपर्क
र्सान ् यह आप क्षेत्र ् तालुर्् मंचर खटाव
contact farmer this you region taluka manchar khatav
संपर्क र्सान*् यह आप क्षेत्र*् तालुर््* मंचर खटाव
Case
Identification
Morphology
Generation
Function
Word
Insertion
Syntax
Linearization
contact farmer* this you region* taluka* manchar
khatav
संपर्क
र्ीजिए
र्सानों
यह आप क्षेत्र
contact .@imperative farmer.@pl this you region
तालुर्े मंचर खटाव
taluka manchar Khatav
संपर्क र्ीजिए र्सानों र्ो इसर्े लए आप क्षेत्र
contact
farmers
this for
you region
या तालुर्े र्े मंचर
खटाव
or taluka of Manchar Khatav
इसर्े लए
This for
तालुर्े र्े
taluka of
आप
मंचर
क्षेत्र
या खटाव
you
manchar region or khatav
र्सानों र्ो
संपर्क र्ीजिए
|
farmers
contact
How to Evaluate UNL Deconversion


UNL -> Hindi
Reference Translation needs to be
generated from UNL


Compromise: Generate reference
translation from original English sentence
from which UNL was generated


Needs expertise
Works if you assume that UNL generation was
perfect
Note that fidelity is not an issue
Input Detail




Marathi -> English -> UNL
901 Sentences from Agriculture
Domain
English: Median length 14 word, std
dev 7.5
Hindi Reference translation
generated from English
Manual Evaluation Guidelines
Fluency of the given translation is:
(4) Perfect: Good grammar
(3) Fair: Easy-to-understand but flawed grammar
(2) Acceptable: Broken - understandable with effort
(1) Nonsense: Incomprehensible
Adequacy: How much meaning of the reference
sentence is conveyed in the translation:
(4) All: No loss of meaning
(3) Most: Most of the meaning is conveyed
(2) Some: Some of the meaning is conveyed
(1) None: Hardly any meaning is conveyed
Sample Translations
Hindi Output
र्वर् र्े र्ारण
आम र्ी नािर्
ु
0.5
प्र तशत र्ा बोर्डो लमश्रण
तो िड़र्ा िाना चाज हए
परीक्षण
र्े अनुरुप
वद्धृ ि र्े ललए
खादों र्ी
पजततयां
य द झल
ु से रहे हैं
10
लीटर
पानी र्े साथ
नयलमत रूप से
फलों र्ी
खुरार्ें
दी िानी चा हए
अच्िी
हमें
मेथी र्ा
फसल
र्े बाद
बोता है
मेथी और
ध नया
फसल र्ो
अच््
बढ़ने या अच््
बढ़ने
िीवाजववर्
संक्रमण से
इसर्ी
िड़ें
बता
प्रभाद्धवत होती हैं
इमु र्ा
पक्षी
रे टाइट र्ा
पररवार र्ो संबंधधत होता है और
थोड़ा
यह शत
समान दखती हैं
ु रु मर्
ु क र्े साथ
Fluency
Adequacy
2
3
3
4
1
1
4
4
2
3
Results
Geometric Average
Arithmetic Average
Standard Deviation
Pearson Cor. BLEU
Pearson Cor. Fluency
BLEU
0.34
0.41
0.25
1.00
0.59
Number of sentences
Fluency vs Adequacy
200
168
165
155
138
131
150
100
50
3134
2 0
12
5
26
4 6
3
21
0
1
2
3
4
Fluency
Adequacy 1
Adequacy 2
Adequacy 3
Adequacy 4
Fluency
2.54
2.71
0.89
0.59
1.00
Adequacy
2.84
3.00
0.89
0.50
0.68
• Good Correlation between
Fluency and BLUE
• Strong Correlation between
Fluency and Adequacy
• Can do large scale evaluation
using Fluency alone
Caution: Domain diversity,
Speaker diversity
Future Work


Extend to Marathi and other
languages
Test in domains other than
agriculture
Thanks !!