DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT

Transcript DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT

MAYURI UMRANIKAR

Introduction Retrieval Environment - The Vector Space Model - INEX Environment - Flexible Retrieval System Method Used for Retrieval - Document Tree – Construction - Ranking of Elements - Output Experiments Conclusions

INTRODUCTION

      Extensible Markup Language (XML) preferred for representing documents and due to increase of documents, issue of element retrieval arises Focus on retrieval of relevant elements rather than entire document INEX – INitiative for Evaluation of XML Retrieval Flexible Mechanisms Different Approaches Term Weighting

RETRIEVAL ENVIRONMENT

       2 Factors – Issues when focus moves from documents to components and Salton’s Vector Space Model Vector Space Model – Weight number of times a term occurs in the document Fox’s Extended Vector Space Model – Incorporation of objective identifiers Document vector consists of subvectors Contain text independently indexed, weighted, searched and retrieved Term Weighting – weighting within subjective vectors Smart Experimental Retrieval System

INEX ENVIRONMENT

   Content Only (CO) –ignore document structure, like typical queries, specify only content of search Content and Structure (CAS) – explicitly refer to structure, exhaustive and specific CO query directly to user, CAS additional filtering and search of body portion   CAS returns rank ordered list of elements INEX-EVAL – uses measures of recall and precision ( fig, exhaustivity, specificity mapped to a single relevance) results are ranked

FLEXIBLE RETRIEVAL SYSTEM

      Smart Format – documents and topics translated, indexed as extended vectors Subjective vectors – contain content bearing terms Objective vectors – serve as filters on result returned by CAS queries Extended vector – subjective vector, terms having a paragraph in body subvector Lnu-ltu weighting Dynamic flexible retrieval- tree representation, rank ordered list by lnu weights

METHOD FOR FLEXIBLE RETRIEVAL

    Input – Query Q given and paragraph, retrieve rank ordered list, terminal modes N top ranked paragraphs as input selected Set of paragraphs used to identify documents – elements generated and returned as output Document Tree – Needs information of structure Terminal nodes Pre-order traversal Terminal nodes found in paragraph index

SIMPLE XML DOCUMENT AND ITS SCHEMA

CONSTRUCTION OF DOCUMENT TREE

        For query Q, n top ranked paras used to build trees Leaf elements or terminal nodes - paragraph nodes Each leaf represented by term-freq weighted frequency vector 1 st – gather all leaf nodes, terminal nodes done 2 nd – merge children vectors for parents Document schema determine merging Parent – unique terms of children, term –freq weighted parent vector( has content of children) Process in recursive manner done

RANKING OF ELEMENTS

         Set of elements of document tree generated Problem- structured retrieval; rank ordered list of elements Method used – All-element index( separate representation for each element of each document and weighting information) Lnu weights - elements variable length, do not require global frequency Normalization and length – failing results in biased values Pivot – document length probability of relevance= probability of retrieval Slope- amount of tilting Pivoted Normalization – reduces difference Lnu term weights: ((1+log(term_freq))/ (1+log(avg_term_freq)))/((1 slope)+slope*((no_unique_terms)/pivot)

    Ltu weighting – N collection size, nk no of elements ((1+log(term_freq))/log(N/nk))/ ((1-slope)+slope*(no_unique_terms)/pivot)) N,nk element dependent, should be known through indexing We move up; N – count elements of each type Nk – inverted file entry in paragraph index, mapping identifiers and xpaths (given)

OUTPUT OF FLEXIBLE RETRIEVAL

    Select another leaf node, gather siblings, construct document tree, calculate Lnu term weights, Ltu weighted query; produce another rank ordered list After n top ranked exhausted, last list produced, merge lists Single set of elements rank ordered – correlation Q Comparison – flexible retrieval & all-element index identical – set of n paragraphs i/p to flexible retrieval have all paragraphs same values used for Lnu-ltu

ALGORITHM

EXPERIMENTS

   Paragraph – result; set of extended vectors representing paragraph CO – subvector represents subjective portion, body subvector important (content of element and not type) contained in body Tree Representation

FACTORS OF INTEREST

     Slope, pivot for Lnu-ltu Effective structure retrieval Can be determined – empirically, applied from one collection to other; Generic N- no of paragraphs input, sets upper bound on number per query Actual trees depend on number of paragraphs having same group or same document

EXPERIMENTS DONE

  All-element and dynamic/flexible retrieval experiments and results - body-only retrieval Correlation between element and query vector produced – correlation of body elements only Table 1

RESULTS

 Tables

    Result equivalent Flexible more efficient – file space Time required for indexing is half Dynamic- Per query basis cost more – n; total trees not exact required specified Another factor – value of nk

DISCUSSIONS AND CONCLUSIONS

      Flexible retrieval dynamically, rank ordered list of elements, single indexing at level - basic indexing node (paragraph) Basic functions- SMART; extended vector model Results – flexible capabilities Attempt to incorporate other subvectors, internal node, weight INEX – exhaustivity and specificity; results exhaustive; specificity research going on; results are reflection It is the better way of retrieval than all-indexing

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT

Transcript DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT