Marathi Search Engine Center for Indian Language

Download Report

Transcript Marathi Search Engine Center for Indian Language

The Marathi Portal with a
Search Engine
Center for Indian Language
Technology Solutions, IIT Bombay
A Search Engine
To promote use of information available
on web in Marathi language
Locate the right pages that you need
Present the pages to the user in an
order of importance
Types of Searches
Based on user queries
Category based search
Browse through pre-classified
categories
Search selected literature which will be
hosted on the Marathi Portal
Search Engine: Performance
Criteria
Coverage
Cover as many pages as possible. A study has
revealed that a large part of the web remains unindexed
Response time
The user should be presented with the results as
quickly as possible
Relevance
The information presented should be relevant and
ordered in an order of importance
Main Components of a Search
Engine
Crawling unit
Indexing unit
Searching unit
Ranking unit
A Prototype
A prototype has been developed to
gauge the complexity and architectural
issues involved in developing the
complete Marathi Portal
About the Prototype
A search engine prototype has been
built with manually selected sites in
different categories
It indexes about 1800 pages consisting
of over 10,14,000 words
The Engine is developed on Windows
platform on MS Access
Monolingual ISFOC pages are covered
Ranking Criteria used in the
prototype
Number of words in the query string that
appear in the document
In OR search, documents containing maximum
number of words in the string is ranked higher
Proximity between words
No. of words that are together within distance of 5
words
Context of the word
Is it in title or body?
Frequency of the desired word in the
document
No. of occurrences of the word
A Fast Engine is under
Development
A Linux based fast prototype for the
same number of pages is being
developed.
It takes 2 minutes to build the
dictionary, 2 hours to build the index
and less than a second to search
What if the Machine that hosts
the engine fails?
The index must be in main memory while
search is being performed
You cannot afford to loose the index since it
would take days (even months for large
engines) to build it again on a large number
of pages
Dumping the index of the Linux prototype
through traversal takes around 35 minutes
But to load it in main memory took 2
minutes!
Requirements from the
Infrastructure for the actual
Portal
High RAM – in GBs
High Computing Power: Parallel Processing
through network of workstations
Parallel IO
As number of users increase, more and more
parallelism will have to be employed to
guarantee same performance criteria to each
user
Representations and Fonts
Currently only ISFOC is supported
There are sites in Marathi with different
types of encodings which need to be
integrated
Converters
Input/Display technology for Linux
Crawling
Crawling and meta-crawling techniques
Some interesting facts:
E.g. it was found that word ‘Aahe’ is one of
the most widely occurring words
Words Aahe and Aani together span most
of the documents
There are specific words that occur most
widely and most frequently in different
categories
Indexing and Searching
Incremental
Dynamic
Fast Search
In Memory
Relevancy
What the user really wants
Heuristics for ranking results
Query modification
Selected Texts
Saint Tukarama’s Abhangs will be made
searchable and will be hosted on this
website
Search on other selected texts will also
be hosted on this website