Marathi – Marathi Monolingual Information Retrieval

Download Report

Transcript Marathi – Marathi Monolingual Information Retrieval

Marathi – Marathi Monolingual
Information Retrieval
Mr. Ashish Almeida
Prof. Pushpak Bhattacharyya
Overview
•
•
•
•
Morphological analyzer
Suffix processing
Stop-words
Future work
Present work

Search “भारत” – bhaarat – Bharat
 Will not match pages which has terms such as
 भारताचा – bharataachaa - Of Bharat
 भारतात – bharataat - In Bharat
 Lack of large size corpus
 Unavailability of tools
Corpus Statistics- Marathi
• 99,275 Documents (510 MB)
– Maharashtra times
– Sakal News
• April 2004 to September 2007
• UTF-8 encoding
• XML tags
– DOC - document
– DOCNO – document identifier
– TEXT - article
Document: example
<DOC>
<DOCNO>MaharashtraC06E811C6B.htm.txt</DOCNO>
<TEXT>
मोहफूल वेचण्यास गेलेल्या तरुणावर बििट्याचा हल्ला
(attack of a leapord on a young man who has gone to collect flowers of Moha)
इस्लापूर, ता. २२ - चारोळी आणण मोहफूल वेचण्यासाठी जंगलात गेलेल्या एका आदिवासी
तरुणावर बििट्याने अचानक हल्ला केल्याने तो तरुण गंभीर जखमी झाला आहे . ही
घटना शुक्रवारी (ता. २०) मुळझरा (ता. ककनवट) या गावाच्या जंगलात घडली. .......
इस्लापूर वन पररक्षेत्र कायाालयाअंतगात येणाऱ्या मुळझरा येथील आदिवासी तरुण मनोहर
...
...
</TEXT>
</DOC>
Topics
• 100 topics
• Aligned with English topics
• XML tags
– num : query identifier
– title: title of the query
– desc: description
– narr: Additional information about the query
• Cover all issues –local, international
Topic example
<top>
<num>1
<title>ट्वें टी-२० ववश्वचषकातील भारताचे क्रीडापटुत्व
(India’s championship in tewnty-20 Worldcup)
<desc>पदहल्या आयसीसी ववश्व ट्वें टी-२० सवोत्कृष्ट-ववजेता-स्पर्धेतील
भारताच्या ववजयाचे वत्ृ त िे णारा लेख शोर्धा.</desc>
<narr>ट्वें टी-२० ववश्चचषक स्पर्धेमर्धील पाककस्तान ववरूद्ध भारताचा
ववजय, ह्या ऐततहाससक ववजया तनसमत्त खेळाडून
ं ी केलेले ववक्रम त्यांनी
समळववलेली िक्षक्षसे व पुरस्काराची रक्कम सामनावीराचे तसेच
मासलकावीराचे नाव, माजी खेळाडून
ं ी आणण जगभरातील लोकांनी केलेली
प्रशंसा यासंिभाात आम्ही उचचत मादहती समळवत आहोत.
</top>
Tools
• Terrier
– Open source IR system
– Models
• TF-IDF (Vector space model)
• DFR-BM25 (Probabilistic)
– Both models available in Terrier
• Evaluation against relevance judged document
for 25 queries
Lemmatizer Vs stemmer
– भारताला bhaarataalaa – for Bharat
– भारताचा bhaarataachaa - of Bharat
– भारतात bhaarataat – in Bharat
– भारतावर bhaarataavar – on Bharat
• Lemmatizer finds Lemma
– भारत
• Stemmer finds stem: Longest unchangeable
word prefix
– भारता
Marathi suffixes
• Suffixes include case markers, postposition
markers etc.
• Suffixes may get attached after another suffix
• Example:
– घरासमोरचािे णखल
– घरा-समोर-चा-िे णखल
– gharaa-samor-chaa-dekhil
– house-front- of-also
– Root word: घर (ghar) (house)
Morphological analyzer
• Use of Marathi morphology analyzer
– Better matching words
• राम versus रामा
• Gives all possible roots
– Selects first root – most frequent
• Used at indexing and query processing end
Lemmatizer Results
MAP
Rprecision
Precision at Precision at
5
10
Recall
TF-IDF without
lemmatizer
0.3366
0.2944
0.3167
0.2583
0.8724
TF-IDF + lemmatizer
0.4003
0.3551
0.3417
0.2917
0.9686
DFR+ without
lemmatizer
0.3455
0.3209
0.3500
0.2667
0.8744
DFR-BM25 +
lemmatizer
0.4140
0.3686
0.3833
0.3083
0.9619
0.3625
0.3797
0.4600
0.3960
0.9178
DFR-BM25 +
lemmatizer
(Fire submission)
Suffixes
• Usually ignored
• Indexing suffixes - not studied
• Index selected suffixes
– Suffixes of space and time
•
•
•
•
वर – var - on
समोर – samor - in front of
मध्ये – madhye - in
नंतर -nanter – after
• Created manually
– 66 words list
Stop-words
•
•
•
•
Most frequently occurring words
Little discriminatory value
Occur in 80 % or more documents
Selected stop-words
– ती, ते, या, ूून, अस, आह, ये, हो, कर, त
Results
suffix indexing and stop-words
DFR-BM25
MAP
Rprecision
Precision Precision
at 5
at 10
Recall
0.4381
0.3846
0.3917
0.3167
0.97085
0.4433
0.3798
0.4000
0.3208
0.9731
+ lemmatization
+ suffix Indexing
DFR-BM25
+ lemmatization
+ suffix Indexing
+ stop-words
P-R graph
• Precision-recall graph for all four cases is show below
lemmatization, indexing suffixes and stopwords
0.8
lemmatization and indexing suffixes
lemmatization
0.7
base-line
0.6
Precision
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
Recall %
60
70
80
90
100
Future work
• Morphological analyzer
– Accuracy 94.5 %
• Needs to be improved
• Heuristic suffix stripping: unknown words
• Handle derivational morphology
• Spelling variations, common spelling mistakes
Acknowledgement
• “Cross Lingual Information Access” Project
• Maharashtra times: Times Media Group,
– http://in.indiatimes.com/aboutus.cms
• Sakal: Sakal Media Group
– http://www.sakaal.in/
References
• http://ir.dcs.gla.ac.uk/terrier/
• Ricardo Baeza Yates and Berthier Ribeiro Neto,
Modern Information Retrieval
• Jacques Savoy, Searching strategies for the
Bulgarian language
• Morphological Analyzer, CFILT
Thank you