Talk-3 - IIT Patna

Download Report

Transcript Talk-3 - IIT Patna

Indo-Australia Workshop on Optimization in
Human Language Technology
16th Dec 2012, IIT Patna
Language Change
as a
Constrained
Multi-Objective
Optimization
Monojit Choudhury
Microsoft Research Lab, India
[email protected]
Language Change
Language Change
• Change in the syntactic/semantic/phonological
features of a language
• Perpetual, universal, directional (?)
• Phonological Change:
– Affects the sounds
– Structured, independent of syntax/semantics
– Example: Loss of consonant clusters in Hindi
agni  aag, dugdha  dUdh, raatri  raat
Effects of the “Lazy Tongue”
Assimilation
•
•
•
•
•
•
in+apt = inapt
in+decent = indecent
in+polite = impolite
in+mature = immature
in+legal = illegal
in+regular = irregular
Deletion
•
•
•
•
•
cannot  can’t
do not  don’t
will not  won’t
are not  ain’t
information  info
Explanations for Change
Exogenous causes
– Language contact
– Socio-political
factors
– Communication
medium
Endogenous causes
–
–
–
–
Functional
Phonetic error-based
Frequency drifts
Evolutionary
Functional Explanation of
Language Change
• There are three evolutionary forces on any
linguistic system:
– Minimization of effort (energy)
– Maximization of perceptual distinctiveness
(Minimization of ambiguity)
– Maximization of learnability
Language is a perpetually evolving system
shaped by these three conflicting forces
Outline of the Talk
• Morpho-phonological change of Bangla Verb
systems and emergence of dialect diversity
– Approach: Multi-Objective Constrained Optimization
– Technique: Multi-Objective Genetic Algorithm (MOGA)
• Understanding Computer Mediated Communication
– Normalization of Texting language
– Romanization of Indian Language text
Geography of Bangla
• Standard Colloquial
Bengali (SCB)
• Agartala Colloquial
Bengali (ACB)
• Sylhetti
History of Bangla
1200 AD
1800 AD
BanglaVerb Morphology
করেছিলাম
kar-echh-il-aam
Verb root
(do)
Aspect
(perfect)
Tense
(past)
I had done
Person
(first)
Cognates in the
Dialects
Features
Classical
SCB
ACB
Non-finite
Ps,2, per.
kariyA
kariyAChila
kore
koreChilo
kairA
korsilo
Ps,1, cont. kariteChilAm
korChilAm
kartAslAm
root: kar (to do)
Atomic Phonological Operators
Deletion, Metathesis
Assimilation, Mutation
kariteChila
Del(e/t_Ch)
karitChila
kariChila
Del(t/_Ch)
Met(ri/_Ch)
kairChila
korChila
Asm(ao/_i)
Mut(a o/_$)
korChilo
Hypothesis
A sequence of Atomic Phonological Operators, is
preferred if the verb forms obtained by application
of this sequence on the classical forms have some
functional benefit over the classical forms.
Thus, all the modern dialects of Bangla have some
functional advantage over the classical dialect.
A Formal Model of Functional
Explanation
Unstable languages
Metastable languages
Impossible languages
f1: Effort of
articulation
f2: [Acoustic
distinctiveness]-1
Genetic Algorithm
Gene (A string of symbols)
How the solution
actually looks like
GA: search for good solutions mimicking
nature [recombination and mutation of genes]
Phenotype
Lexicon consisting of 28
forms for the verb kar
kori
kori
korChi
kartAsi
:
:
korte
kartA
Genotype
A sequence of atomic phonological operators
Del t
Met ri NOP Del e Asm a Del i NOP
Dsm e NOP NOP Met ri Asm a Del e NOP
Genotype  Phenotype
Del t
Met ri NOP Del e Asm a Del i NOP
kari
kariteChi
karite
kari
karieChi
karie
kair
kaireChi
kaire
kor
korCh
kor
Crossover
Mutation
Multi-Objective GA
Multi-Objective GA: Apply
constraints
Multi-Objective GA: Apply
constraints
Multi-Objective GA: Finding out
good solutions
Multi-Objective GA: But also keep
some not-so-good solutions
Multi-Objective GA: But also keep
some not-so-good solutions
Multi-Objective GA: After several
iterations
Objective functions
• Articulatory effort
– fe(Λ): weighted sum of number of syllables,
letters and vowel height differences averaged
over all words in the lexicon
• Acoustic Distinctiveness
– fd(Λ): Inverse of mean edit distance between
words
• Learnability
– fr(Λ): correlation between feature match and
edit distance
Experiments
•
•
•
•
•
NSGA – II : a package for fast MOGA
Gene length: 15 APOs
A repertoire of 128 APOs
Population: 1000, Generation: 500
6 Models with different combinations of
constraints and objectives
Pareto-optimal front
SCB
ACB
Sylhetti
CB
Observations
• vertical and horizontal limb
• real dialects on the horizontal limb
• Sound changes push the dialects from right
to left (reduce effort)
• but never up the limb
• why?
Role of Constraints
For more information
Choudhury et al., Evolution optimization and language
change: the case of Bengali verb inflections, in Proceedings
of ACL SIGMORPHON9, Association for Computational
Linguistics, 2007
http://research.microsoft.com/people/monojitc/
MOGA and NSGA II
Kanpur Genetic Algorithms Laboratory
http://www.iitk.ac.in/kangal/index.shtml
Food for Thought
• Evaluation:
– Myriads of possible dialects, but only a few
observed in nature
• Fixed set of pre-defined APOs – how to
generalize for any change?
• MOGA is an optimization tool, which in no
way simulates language change
– How do languages optimize themselves?
Outline of the Talk
• Morpho-phonological change of Bangla Verb
systems and emergence of dialect diversity
– Approach: Multi-Objective Constrained Optimization
– Technique: Multi-Objective Genetic Algorithm (MOGA)
• Understanding Computer Mediated Communication
– Normalization of Texting language
– Romanization of Indian Language text
Computer Mediated Communication
Form
Texting Language
• A new genre of English & also other languages
used in chats, sms, emails, blogs, tweets, FB
posts, comments etc.
dis is n eg 4 txtin lang
This is an example for Texting language
Texting Language
• A new genre of English & also other languages
The
shorter
 theblogs,
faster etc.
used in chats,
sms,
emails,
Constraint: understandability
• Ungrammatical, unconventional spellings
24
dis is n eg 4 txtin lang
39
This is an example for Texting language
Analysis of Social Media
• A hot topic in NLP
–
–
–
–
Normalization
Language identification
Sentiment/Polarity detection
Summarization/trend prediction
Choudhury et al. (2007) Investigation and Modeling
of the Structure of Texting Language. In IJCAI
Workshop on Analytics of Noisy Data 2007
Tomorrow never dies!!!
•
•
•
•
•
•
•
•
2moro (9)
tomoz (25)
tomoro (12)
tomrw (5)
tom (2)
tomra (2)
tomorrow (24)
tomora (4)
•
•
•
•
•
•
•
•
tomm (1)
tomo (3)
tomorow (3)
2mro (2)
morrow (1)
tomor (2)
tmorro (1)
moro (1)
Patterns or Compression Operators
• Phonetic substitution (phoneme)
– psycho  syco, then  den
• Phonetic substitution (syllable)
– today  2day , see  c
• Deletion of vowels
– message  mssg, about  abt
• Deletion of repeated characters
– tomorrow  tomorow
Patterns or Compression Operators
• Truncation (deletion of tails)
– introduction  intro, evaluation  eval
• Common Abbreviations
– Bangalore  blr, text back  tb
• Informal pronunciation
– going to  gonna, better  betta
HMMs for SMS Normalization
S0
ε T @
ε O @
ε D @
ε A @
ε Y @
G1
‘T’
G2
‘O’
G3
‘D’
G4
‘A’
G5
‘Y’
P2
/AH/
S1
“2”
P4
/AY/
S6
Bigram Examples
• TL: would b gd 2 c u some time soon
• Op: would be good to see you some time soon
• TL: just wanted 2 say a big thanx 4 my bday card
• Op: just wanted to say a big thanks for my today
card
• TL: me wel i fink bein at home makes me feel a
lot more stressed den bein away from it
• Op: me well i think being at home makes me feel
a lot more stressed deny being away from it
Use of Indian Languages on
Online Social Media
Transliteration
Spelling
Change
Code mixing
Indian English
Concluding Remarks
• Languages are perpetually evolving and
optimizing systems
– Computational modeling of language change is
still in its infancy
– Lots of scope for research
Thank You!
[email protected]
Questions??
Why Computational Models?
Exploration
Toy languages
Virtual experimentation
Simplified assumptions
Formalization
Intractable
FOR
AGAINST
Can we model
real world language change?
Objectives and Constraints - 1
• Articulatory effort
fe(w) = α1 fe1(w) + α2 fe2(w) + α3 fe3(w)
fe1(w) = |w|
fe2(w) =  hr(σi)
fe3(w) =  |ht(Vi) - ht(Vi+1)|
Objectives and Constraints - 2
• Acoustic distinctiveness
fd(Λ) = (1/N)  ed(wi,wj)-1
Cd(Λ) = -1 if ed(wi,wj) = 0 for > 2 pairs
• Phonotactic constraints
Cp(Λ) = -1 if any of the words violate
the phonotactic constraints of the language
Objectives and Constraints - 3
• Learnability as Regularity
– fr: The correlation coefficient between the edit
distance and number of matching
morphological attributes for every word pair
– Cr = -1 if fr > 0.8
Emergent dialects
Classical
D1
D2
D3
kariteChilAm
kartA
kariteChila
kartAa
kariteChilen
kartAen
karChi
(korChi)
karCha
(korCha)
karChen
(korChen)
karteChi
(kartAsi)
karteCha
(kartAsa)
karteChen
(kartAsen)