Transcript Document
Project Tukaram Sagar Tamhane Centre for Indian Language Technology Solutions IIT Bombay 12 June 2002 Center For Indian Language Technology Solutions 1 12 June 2002 Center For Indian Language Technology Solutions 2 The Goal • To make Saint Tukaram’s Abhangas available over web for browsing and searching • Locate the right Abhangas that you need. • Present the pages to the user in an order of importance. 12 June 2002 Center For Indian Language Technology Solutions 3 The Source • The Abhangas are typed from a book called “EaI tukaramabaavaaMcyaa ABaMgaaMcaI gaaqaa” published on 6th November 1973 by the Govt. of Maharashtra • Previous editions: 1950 and 1955. • Number of Abhangas: 4644 Center For Indian Language 12 June 2002 Technology Solutions 4 Creation of Web Content • Software used for typing: MS Word with Akruti_Priya_Expanded font and Akruti keyboard driver • Problems faced: – Non displayable characters Eg: This was typed as mna • Automated page splitting 12 June 2002 Center For Indian Language Technology Solutions 5 Converters Used • Akruti_Priya_Expanded ISCII converter: required for indexing the text • ISCII Monolingual ISFOC converter: required for displaying the text through DV-TTYogesh • XDVNG ISCII: for query strings to ISCII 12 June 2002 Center For Indian Language Technology Solutions 6 Technologies used for the Tukaram Search Engine • Input Technology: – Jtrans: XDVNG font • Keyboard Mapping: – Phonetic English • Result Display at client: – ISFOC • Encoding for indexing (storage): – ISCII 12 June 2002 Center For Indian Language Technology Solutions 7 Architecture 12 June 2002 Center For Indian Language Technology Solutions 8 Input Technology 12 June 2002 Center For Indian Language Technology Solutions 9 Components of the Search Engine • Index – Case sensitive ISCII – Database structure • Searcher – In-memory search – Algorithm: Hybrid of Hashing & Binary search 12 June 2002 Center For Indian Language Technology Solutions 10 Database Structure 12 June 2002 Center For Indian Language Technology Solutions 11 • Snap shot of result 12 June 2002 Center For Indian Language Technology Solutions 12 Relevancy Criteria • • • • Number of query words in the abhang Position Adjacency Total number of words in the abhang 12 June 2002 Center For Indian Language Technology Solutions 13 12 June 2002 Center For Indian Language Technology Solutions 14 12 June 2002 Center For Indian Language Technology Solutions 15 12 June 2002 Center For Indian Language Technology Solutions 16 12 June 2002 Center For Indian Language Technology Solutions 17 12 June 2002 Center For Indian Language Technology Solutions 18 12 June 2002 Center For Indian Language Technology Solutions 19 12 June 2002 Center For Indian Language Technology Solutions 20 General information • • • • • • Number of abhangas : 4,644 Total number of words : 2,09,702 Number of distinct words : 34,773 Languages used for converters: Lex & C Language used for search engine: Java 2 Scripting on client side : JavaScript 12 June 2002 Center For Indian Language Technology Solutions 21 Thank You 12 June 2002 Center For Indian Language Technology Solutions 22