Multi-Lingual Wordnets: Coimbatore Workshop (11-14 June, 2009) at

Download Report

Transcript Multi-Lingual Wordnets: Coimbatore Workshop (11-14 June, 2009) at

amritesharyai namah: Obeiscance to amma
Multi-Lingual Wordnets:
Coimbatore Workshop
(11-14 June, 2009)
at
Amrita University
Pushpak Bhattacharyya
Computer Science and Engineering Department
Indian Institute of Technology
Bombay
Objective of the wordnet
workshop
• PAN-Indian Wordnets
• Involving languages from the North East, the Western
part, the Northern part and the Southern part of India
– Sanskrit
– Assamese, Bodo, Nepali, Manipuri
– Hindi, Kashmiri
– Marathi Konkani
– Tamil, Telugu, Kannad, Malayalam
– English
• Meeting minds of those who LOVE WORDS
AND THEIR RELATIONSHIPS
Ambiguity
The Crux of the problem
Stages of language processing
•
•
•
•
•
•
•
Phonetics and phonology
Morphology
Lexical Analysis
Syntactic Analysis
Semantic Analysis
Pragmatics
Discourse
Phonetics
•
•
Processing of speech
Challenges
– Homophones: bank (finance) vs. bank (river
bank)
– Near Homophones: maatraa vs. maatra (hin)
– Word Boundary
• aajaayenge (aa jaayenge (will come) or aaj aayenge (will come
today)
• I got [ua]plate
– Phrase boundary
• mtech1 students are especially exhorted to attend as such seminars
are integral to one's post-graduate education
– Disfluency: ah, um, ahem etc.
Morphology
• Word formation rules from root words
• Nouns: Plural (boy-boys); Gender marking (czar-czarina)
• Verbs: Tense (stretch-stretched); Aspect (e.g. perfective sit-had
sat); Modality (e.g. request khaanaa khaaiie)
• First crucial first step in NLP
• Languages rich in morphology: e.g., Dravidian, Hungarian,
Turkish
• Languages poor in morphology: Chinese, English
• Languages with rich morphology have the advantage of easier
processing at higher stages of processing
• A task of interest to computer science: Finite State Machines for
Word Morphology
Lexical Analysis
• Essentially refers to dictionary access and
obtaining the properties of the word
e.g. dog
noun (lexical property)
take-’s’-in-plural (morph property)
animate (semantic property)
4-legged (-do-)
carnivore (-do)
Challenge: Lexical or word sense disambiguation
Lexical Disambiguation
First step: part of Speech Disambiguation
• Dog as a noun (animal)
• Dog as a verb (to pursue)
Sense Disambiguation
• Dog (as animal)
• Dog (as a very detestable person)
Needs word relationships in a context
• The chair emphasised the need for adult education
Very common in day to day communications
Satellite Channel Ad: Watch what you want, when you want
(two senses of watch)
e.g., Ground breaking ceremony/research
Technological developments bring in new
terms, additional meanings/nuances for
existing terms
– Justify as in justify the right margin (word
processing context)
– Xeroxed: a new verb
– Digital Trace: a new expression
– Communifaking: pretending to talk on mobile
when you are actually not
– Discomgooglation: anxiety/discomfort at not
being able to access internet
– Helicopter Parenting: over parenting
Syntax Processing Stage
Structure Detection
S
VP
NP
V
NP
I
like
mangoes
Parsing Strategy
• Driven by grammar
•
•
•
•
•
•
S-> NP VP
NP-> N | PRON
VP-> V NP | V PP
N-> Mangoes
PRON-> I
V-> like
Challenges in Syntactic
Processing: Structural Ambiguity
• Scope
1.The old men and women were taken to safe locations
(old men and women) vs. ((old men) and women)
2. No smoking areas will allow Hookas inside
• Preposition Phrase Attachment
• I saw the boy with a telescope
(who has the telescope?)
• I saw the mountain with a telescope
(world knowledge: mountain cannot be an instrument of
seeing)
• I saw the boy with the pony-tail
(world knowledge: pony-tail cannot be an instrument of
seeing)
Very ubiquitous: newspaper headline “20 years later, BMC
pays father 20 lakhs for causing son’s death”
Structural Ambiguity…
• Overheard
– I did not know my PDA had a phone for 3 months
• An actual sentence in the newspaper
– The camera man shot the man with the gun when he was near
Tendulkar
• (P.G. Wodehouse, Ring in Jeeves) Jill had rubbed
ointment on Mike the Irish Terrier, taken a look at the
goldfish belonging to the cook, which had caused anxiety
in the kitchen by refusing its ant’s eggs…
• (Times of India, 26/2/08) Aid for kins of cops killed in
terrorist attacks
Headache for Parsing: Garden
Path sentences
• Garden Pathing
– The horse raced past the garden fell.
– The old man the boat.
– Twin Bomb Strike in Baghdad kill 25 (Times of
India 05/09/07)
Semantic Analysis
• Representation in terms of
• Predicate calculus/Semantic
Nets/Frames/Conceptual Dependencies and
Scripts
• John gave a book to Mary
• Give action: Agent: John, Object: Book,
Recipient: Mary
• Challenge: ambiguity in semantic role labeling
– (Eng) Visiting aunts can be a nuisance
– (Hin) aapko mujhe mithaai khilaanii padegii
(ambiguous in Marathi and Bengali too; not in
Dravidian languages)
Pragmatics
• Very hard problem
• Model user intention
– Tourist (in a hurry, checking out of the hotel,
motioning to the service boy): Boy, go upstairs
and see if my sandals are under the divan. Do not
be late. I just have 15 minutes to catch the train.
– Boy (running upstairs and coming back panting):
yes sir, they are there.
• World knowledge
– WHY INDIA NEEDS A SECOND OCTOBER (ToI,
2/10/07)
Discourse
Processing of sequence of sentences
Mother to John:
John go to school. It is open today. Should you
bunk? Father will be very angry.
Ambiguity of open
bunk what?
Why will the father be angry?
Complex chain of reasoning and application of
world knowledge
Ambiguity of father
father as parent
or
father as headmaster
Complexity of Connected Text
John was returning from school dejected
– today was the math test
He couldn’t control the class
Teacher shouldn’t have made him
responsible
After all he is just a janitor
Lexical Knowledge Structures
Indian Scenario
Linked Wordnets
Bengali
Wordnet
Dravidian
Language
Wordnets
Sanskrit
Wordnet
Punjabi
Wordnet
Hindi
Wordnet
North East
Language
Wordnet
Konkani
Wordnet
Marathi
Wordnet
English
Wordnet
Great Linguistic Diversity
•
•
Major streams
– Indo European
– Dravidian
– Sino Tibetan
– Austro-Asiatic
Some languages are ranked
within 20 in the world in terms of
the populations speaking them
– Hindi and Urdu: 5th (~500
milion)
– Bangla: 7th (~300 million)
– Marathi 14th (~70 million)
Major Language Processing Initiatives
• Mostly from the Government: Ministry of IT,
Ministry of Human Resource Development,
Department of Sceince and Technology
• Recently great drive from the industry: NLP
efforts with Indian language in focus
– Google
– Microsoft
– IBM Research Lab
– Yahoo
– TCS
Technology Development in Indian
Languages (TDIL)
• Started by the Ministry of IT in 2000
• 13 resource center across the country
• Responsibility for two languages: one major and one
minor
• For example,
– IIT Bombay: Marathi and Konkani
– IIT Kanpur: Hindi and Nepali
– ISI Kolkata: Bangla and Santhaali
– Anna University: Tamil
Achievements in TDIL: Lexical
Resources
Wordnets: Hindi and Marathi (IIT Bombay)
• Ontologies: Tamil concept hierarchy (Tanjavur University,
AU-KBC)
• Semantically rich lexicons: IIT Kanpur, IIITH, IIT Bombay
• Corpora: Central Institute of Indian Languages (CIIL)
• Web Content: All 13 centers, Gujarathi content is exhaustive
and of good quality
Recent Initiatives
• NLP Association of India: 2 years old:
recently efforts are on making tools and
resources freely available on the website of
NLPAI
• LDC-IL (like the Linguistic Data Consortium at
UPenn)
– Approved by the planning commission
• National Knowledge Commission: special
drive on translation (human and machine)
Recent Initiatives cntd
• Consortia set up already for IL-IL MT, E-IL MT and
CLIA
• SAALP: South Asian Association for Language
Processing (formed with SAARC countries)
Industry Scenario: English
• How to use NLP to increase the search engine
performance (precision, recall, speed)
• Google, Rediff, Yahoo, IRL, Microsoft: all have search
engine, IR, IE R & D projects outsourced from USA and
being carried out in India.
Industry Scenario: Indian Language
• English-Hindi MT is regarded as critical
• IBM Research lab has massive English Hindi Parallel
Corpora (news domain)
– Statistical Machine Translation
• Microsoft India at Bangalore has opened a
Multilingual Computing Division
• Google and Yahoo India is actively pursuing IL
search engine
Related work
• Eurowordnet (Vossen, 1999) and Balkanet
(Christodoulakis, 2002)
– where synsets of multiple languages are
linked among themselves and
• to the Princeton Wordnet (Miller et. al., 1990;
Fellbaum, 1998)
– through Inter-lingual Indices (ILI)
Our experience: Multilingual
Wordnets for Indian
Languages
Wordnet work at IIT Bombay
• http://www.cfilt.iitb.ac.in
• Follow the design principle(s) of the Princeton
Wordnet for English paying particular attention to
language specific phenomena (such as complex
predicates)
• Hindi Wordnet
– Total Number of Synsets: >30,000
– Total Number of Unique Words: >65,000
• Marathi Wordnet
– Total Number of Synsets: >18,000
– Total Number of Unique Words: >30,000
HWN and MWN created using different
principles
(Tatsam, i.e., Sanskrit words borrowed as
such:
very often)
HWN entry:
{peR, vriksh, paadap, drum, taru, viTap, ruuksh, ruukh, adhrip,
taruvar} ‘tree’
jaR,tanaa, shaakhaa, tathaa pattiyo se yukt bahuvarshiya vanaspati
‘perennial woody plant having root, stem, branches and leaves’
peR manushya ke lie bahut hi upayogii hai ‘trees are useful to men’
MWN entry:
{jhaaR, vriksh, taruvar, drum, taruu, paadap} ‘tree’
mule, khoR, phaanghaa, pane ityaadiinii yokt asaa vanaspativishesh
‘perennial woody plant having root, stem, branches and leaves’
tii damuun jhaadacyaa saavlit baslii ‘Being tired/exhausted she sat
under the shadow of the tree’
Hindi WN: recently made free
A glimpse of the wordnet
वनस्पती
रान
HYPERNYMY
खोड
मूळ
M
E
R
O
N
Y
M
Y
H
O
L
O
N
Y
M
Y
झाड, वक्ष
ृ , तरू
बा
ग
GLOSS
HYPONYMY
ल ब
ं ू
आंबा
मळ
ु े ,खोड,फांद्या,पाने इत्याद ंनी यक्
ु त
असा वनस्पततववशेष:"झाडे पयाावरण
शद्
ु ध करण्याचे काम करतात"
Marathi WN created from Hindi:
expansion approach: issues
• For a concept, words exist in both Hindi and Marathi:
most common
• For a concept, words exist in Hindi but not in Marathi
– {दादा [daadaa, grandfather], बाबा [baabaa, grandfather], आजा
[aajaa, grandfather], दद्दा [daddaa, grandfather], वपतामह
[pitaamaha, grandfather], प्रवपता [prapitaa, grandfather]} are
words in Hindi for paternal grandfather. There are no equivalents
in Marathi.
• For a concept, words exist in Marathi and not in
Hindi
– {गढ
ु पाडवा [gudhipaadvaa, newyear], वषाप्रततपदा [varshpratipadaa,
new year]} are words in Marathi which do not have any
equivalents in Hindi.
Analogy with English
•
•
•
•
{mama}: uncle from mother’s side
{chacha}: uncle from father’s side
No natural words in English
Introduce multiwords
– {uncle, maternal uncle} and {chacha, paternal uncle}
• Makes the lexical resource look unnatural to a
native speaker
• Pitfall of expansion approach?
• WN users tend to look upon and use the lexical
resource as an ordinary dictionary.
Other concerns
• Identical word
– Faux Amis: “false friends” or “false cognates”
• samaadhaan- solution (Hindi), satisfaction (Marathi)
• shikshaa- education (Marathi), punishment (Marathi)
– Narrowing of meaning
– Widening of meaning
• Identical Meaning
– Richness of vocabulary in Hindi and not in Marathi
and vice versa (like the words for snow in Eskimo
language)
Narrowing and Widening of
meaning
Hindi
Hypernymy/
hyponymy
Marathi
Same
Word
Same
Word
Hypernymy/
hyponymy
Hindi
Marathi
Dictionary standardization
Large Scale Nation Wide Projects
in Consortia Mode
• English to Indian Language Machine
Translation
• Indian Language to Indian Language
Machine Translation
• Cross Lingual Information Access
– Each of about 800 Crores of Rupees,
equivalent to about 200 million dollars
– In each participation by 10 different institutes
across the length and breadth of the country
Adopted Standard
Senses
Hindi
(W1, W2, W3,
(W1, W2,
W3, W 4, W5, W 4 , W5 , W6 )
W6 )
(sun)
(सर्
ू ,य सरू ज, भान,ु भास्कर,
प्रभाकर, दिनकर, अंशुमान,
अंशुमाली)
(cub, lad,
laddie,
sonny,
sonny boy)
(son, boy)
(लड़का, बालक, बच्चा,
छोकड़ा, छोरा, छोकरा,
लौंडा )
पत्र
ु , बेटा, लड़का, लाल,
सत
ु , बच्चा, नंिन, पत
ू ,
चचरं जीव, चचरं जी )
(
Marathi
Bangali Oriya
Tamil
(W1, W2,
W3)
(W1, W2 ,
W3)
(W1, W2,
W3, W4 )
(W1, W2,
W3)
(सर्
ू ,य भान,ु दिवाकर,
भास्कर, रवव, दिनेश,
दिनमणी)
...
...
...
मल
ु गा, पोरगा,
पोर, पोरगे )
…
…
…
मल
ु गा, पुत्र,
लेक, चचरं जीव,
तनर् )
…
…
…
(
(
Advantages of the concept based
multilingual dictionary (1/2)
• Economy of labor and storage
– Semantic features like [±Animate, ±Human,
±Masculine, etc.] assigned to a nominal
concept and not to any individual lexical item
of any language
– Semantic features, such as [+Stative (e.g.,
know), +Activity (e.g., stroll),
+Accomplishment (e.g., say), +Semelfactive
(e.g., knock), +Achievement (e.g., win)] are
assigned to a verbal concept.
Advantages of the concept based
multilingual dictionary (2/2)
• Bilingual pairwise dictionaries can be generated
automatically.
• The model admits of the possibility of extracting
a domain specific dictionary for all or any
specific language pair.
• The language group which lacks competence in
the pivot language- which in our case is Hindican benefit from the already worked out
languages.
– E.g. Tamil and Malayalam
Word alignment in the dictionary
model
• Even if we choose the right sense of a
word in the source language (SW1), there
is still the hurdle of choosing the
appropriate target language word.
• Lexical choice is a function of complex
parameters like situational aptness and
native speaker acceptability.
Example
• Concept: ‘the state of having no doubt of
something’
– Hindi: {nishshank, anaashankita,
aashankahiin,befikr, bekhtak, sangshayhiin}
– Marathi: {nihshanka, nirdhaasta, nirbhrot,
shankaarahita}
• Third member in the Hindi synset
aashankahiin is appropriately mapped to
the fourth member in the Marathi synset
shankaarahita and not to the first one.
Links set up between words
Marathi synset
Hindi synset
मुलगा /HW1 ,
पोरगा /HW6,
पोर /HW2,
पोरगे /HW6
लड़का /HW1,
बालक /HW2,
बच्चा /HW3,
छोकड़ा /HW4 ,
छोरा /HW5,
छोकरा /HW6,
लौंडा /HW7
English synset
malechild/HW1,
boy /HW2
Linguistic challenges (1/2)
• Using a synthetic expression
– ‘ornaments and other gifts given to the bride by the bridegroom
on the day of wedding’ chadhaava (Hindi)
– वववाहसमयी वराकडून वधु ा दद े जाणारे दागगने ‘at-the-time-ofwedding–bridegroom–bride– given–ornament’ (Marathi)
• Using transliteration, if the synthetic expression is larger
– seharaa (~garland: complicated cultural expression)
– Seharaa (transliterated in Marathi)
• Reciprocally, maahervaashiin ‘a woman who has come
to stay at her parents' place after her marriage: no
equivalent in Hindi
Linguistic challenges (2/2)
• Singleton Hindi pivot synset  expressed
through more than one finer concept in Marathi
• fikaa in Hindi: ‘food prepared with less sugar,
salt or spice’,
• Marathi equivalent: three distinct words
expressing three distinct finer concepts
– agodh ‘less sweet’
– aLanii ‘less salty
– miLamiLat ‘less spicy’.
• These three words cannot be taken as the
members of a single synset in Marathi
Computational Aspects
Dictionary development framework
Dictionary entry template
• ID
:: 02691516
• CAT
:: verb
• CONCEPT
:: be in a state of
movement or action
• EXAMPLE
:: "The room abounded
with screaming children"
• SYNSET-ENGLISH
:: (abound, burst,
bristle)
Language and Task
Configuration window
Synset entry and word-alignment
interface
Conclusion (1/3)
• Linked wordnets: Immense Lexical Resource
• Great benefits to machine translation, cross
lingual search
• Very useful for language teaching, pedagogy,
comparative linguistics
• Akin to Eurowordnet, but critical differences due
to typical Indian language characteristics
• Great Unifier of the country
Conclusion (2/3)
• Computational
challenges:
– Maintenance of
multilingual data
– their insertion, deletion
and updating in a
spatially and
temporally distributed
situation
Conclusion (3/3)
• Advantages of the framework
– a linguistically sound basis of the dictionary
framework
– economy of representation and
– avoidance of duplication of effort