Transcript Document

Project Tukaram
Sagar Tamhane
Centre for Indian Language
Technology Solutions
IIT Bombay
12 June 2002
Center For Indian Language
Technology Solutions
1
12 June 2002
Center For Indian Language
Technology Solutions
2
The Goal
• To make Saint Tukaram’s Abhangas
available over web for browsing and
searching
• Locate the right Abhangas that you need.
• Present the pages to the user in
an order of importance.
12 June 2002
Center For Indian Language
Technology Solutions
3
The Source
• The Abhangas are typed from a book called
“EaI tukaramabaavaaMcyaa ABaMgaaMcaI
gaaqaa”
published on 6th November 1973 by the
Govt. of Maharashtra
• Previous editions: 1950 and 1955.
• Number of Abhangas:
4644
Center For Indian Language
12 June 2002
Technology Solutions
4
Creation of Web Content
• Software used for typing: MS Word with
Akruti_Priya_Expanded font and Akruti keyboard
driver
• Problems faced:
– Non displayable characters
Eg:
This was typed as
mna
• Automated page splitting
12 June 2002
Center For Indian Language
Technology Solutions
5
Converters Used
• Akruti_Priya_Expanded
ISCII
converter: required for indexing the text
• ISCII
Monolingual ISFOC
converter: required for displaying the text
through DV-TTYogesh
• XDVNG
ISCII: for query strings to
ISCII
12 June 2002
Center For Indian Language
Technology Solutions
6
Technologies used for the
Tukaram Search Engine
• Input Technology:
– Jtrans: XDVNG font
• Keyboard Mapping:
– Phonetic English
• Result Display at client:
– ISFOC
• Encoding for indexing (storage):
– ISCII
12 June 2002
Center For Indian Language
Technology Solutions
7
Architecture
12 June 2002
Center For Indian Language
Technology Solutions
8
Input Technology
12 June 2002
Center For Indian Language
Technology Solutions
9
Components of the Search
Engine
• Index
– Case sensitive ISCII
– Database structure
• Searcher
– In-memory search
– Algorithm: Hybrid of Hashing & Binary search
12 June 2002
Center For Indian Language
Technology Solutions
10
Database Structure
12 June 2002
Center For Indian Language
Technology Solutions
11
• Snap shot of result
12 June 2002
Center For Indian Language
Technology Solutions
12
Relevancy Criteria
•
•
•
•
Number of query words in the abhang
Position
Adjacency
Total number of words in the abhang
12 June 2002
Center For Indian Language
Technology Solutions
13
12 June 2002
Center For Indian Language
Technology Solutions
14
12 June 2002
Center For Indian Language
Technology Solutions
15
12 June 2002
Center For Indian Language
Technology Solutions
16
12 June 2002
Center For Indian Language
Technology Solutions
17
12 June 2002
Center For Indian Language
Technology Solutions
18
12 June 2002
Center For Indian Language
Technology Solutions
19
12 June 2002
Center For Indian Language
Technology Solutions
20
General information
•
•
•
•
•
•
Number of abhangas
:
4,644
Total number of words :
2,09,702
Number of distinct words :
34,773
Languages used for converters: Lex & C
Language used for search engine: Java 2
Scripting on client side : JavaScript
12 June 2002
Center For Indian Language
Technology Solutions
21
Thank You
12 June 2002
Center For Indian Language
Technology Solutions
22