Transcript Computing Resources - Indian Statistical Institute
DCU meets MET: Bengali and Hindi Morpheme Extraction
Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Ireland
Outline
Motivation Task Description Bengali Stemming Approach Hindi Stemming Approach Results Conclusions and Future Work
Motivation
Some languages have complex inflectional and derivational morphology, i.e. the same base form can correspond to multiple surface word forms Example: company, companies → company; hopeful → hope For information retrieval, indexing surface forms would lead to many mismatches between query terms and index terms extracted from documents Index base forms/stems: Reduce different surface forms to the same index form (stem, lemma) to increase the chance of matching query term with document terms
Task Description
Morpheme Extraction Task: Investigate effect of morphologic analysis/ lemmatization/ stemming on information retrieval (IR) performance (for Indian languages) Subtasks: Subtask 1: manual evaluation of morpheme extraction Subtask 2: IR evaluation using the proposed morpheme representation as index terms. Evaluation metric is mean average precision (MAP)
Stemming Approaches
Light vs aggressive stemming Rule-based vs. corpus-based stemming manually created vs. cluster of related words iteratively remove word suffixes problem: overstemming, i.e. removed suffix is too long e.g. international/intern; news/new understemming, i.e. removed suffix is too short e.g. forgetfulness/forgetful irregular forms e.g. feet/foot; women/woman
Our Bengali Stemming Approach
Rule-based stemmer created by native speaker Focus on nouns (most important for IR) Four categories [Bhattacharya et al. 2005]:
Title markers
added as suffixes to proper nouns
e.g. “
দেবী ” (Mrs.), “ বাবু ” (sir)
Classifier
for plurality and specificity/gender of a noun e.g. ছববগুল া (Pictures) , ছববটা (the Picture) , ছাত্রী (female student)
Case marker
for possessive or accusative relations e.g. পবিবালিি (family ’s)
Emphasizer
to
emp
hasize the current word e.g. ছববই (only a picture), ছববটাই (only this picture)
Bengali Stemmer
Drop emphasizers (iteratively) e.g. আবিক্যই আবিক্য Drop classifiers and case markers e.g.
মন্ত্রীিাও মন্ত্রী , ভািলেি ভািে Drop title markers e.g.
মমোলেবী মমো Drop plural suffixes e.g.
ভািেীয়লেি ভািেীয় Drop derivational suffixes e.g.
বিেীশী বিেী
Our Hindi Stemming Approach
Hindi has less complex inflectional morphology fewer stemming rules Rule-based stemmer Stemming rules manually created by native Hindi speaker
Hindi Stemmer
Iteratively remove Hindi vowels, Matras, Anusvara, and “ य ” (character ya) from the right of a string until first consonant is encountered Drop derivational suffixes, e.g.
लड़क ों (to boys) लड़ककय ों (to girls) लड़का लड़की (boy) (girl)
MET Experiments
Experiments for Bengali and Hindi Stemmers implemented in C Submission as source code Stemmed forms are used for retrieval with Terrier
Results
Team
Baseline JU DCU IIT-KGP CVPR-Team ISM Baseline DCU ISM
Language
Bengali Bengali Bengali Bengali Bengali Bengali Hindi Hindi Hindi
MAP
0.2740
0.3307
0.3300
0.3225
0.3159
0.3103
0.2821
0.2963
0.2793
(+20.69%) (+20.44%) (+17.70%) (+15.29%) (+13.25%) (+5.03%) (-0.99%)
Conclusions
Bengali stemmer: 2 nd best performance Hindi stemmer: Best performance Both have also been used successfully in previous ad-hoc IR experiments for FIRE
Future work
Explore use of exclusion lists for irregular cases Extend rule set (i.e. handle verbs) Compare to other stemmers for Bengali/Hindi e.g. Indian language in version 4 of Lucene; stemmers from Jacques Savoy ’s web page on cross-language IR Investigate morphology of named entities
Thank
+s for your attention Any
question
+s ?